Week 03

ASCII code table

Reguar Expressions (RE or regexp)

Review:
  1. ^ and $ are beginning and end of line, respectively.
  2. Brackets [] define the range. For example, numbers are [0-9] while alphabets are [A-Za-z].
  3. ^ inside [] after [ means inverse match. For example, [^0-9] means NOT numbers.
  4. Escape with \ and ^V.
    1. \s \t \r \n
    2. Shorthand character classes (only works in vimregex and Perl compatible RE, PCRE ):
      \d , \a and \w match [0-9], [A-Za-z] and [A-Za-z0-9_], respectively.
      \D , \A and \W are the inverse matches of them.
  5. Pattern starts with \< and ends with \>
  6. Pattern recurrs n times with \{n\}
  7. & is the pattern that matched.
  8. . denotes anything except \n, for example, p.p matches pip, pap, p2p or even p◻p.
  9. * denotes for "appears zero or more times". For example, pis* matches pi, pis, piss, pisss and more s.
    .* matches anything that appears for any times or not at all.
Search patterns: /, ?, *, #, \, \>, \c, ^ and $.
Escape characters in vim (backslash \ and Ctrl-V).

When searching:

., *, \, [, ], ^, and $ are metacharacters.
+, ?, |, {, }, (, and ) must be escaped to use their special function.
\/ is / (use backslash + forward slash to search for forward slash)
\t is tab, \s is whitespace
\n is newline, \r is CR (carriage return = Ctrl-M = ^M)
\{#\} is used for repetition. /foo.\{2\} will match foo and the two following characters. The \ is not required on the closing } so /foo.\{2} will do the same thing.
\(foo\) makes a backreference to foo. Parenthesis without escapes are literally matched. Here the \ is required for the closing \).

When replacing:

\r is newline, \n is a null byte (0x00).
\& is ampersand ( & is the text that matches the search pattern ).
\0 inserts the text matched by the entire pattern
\1 inserts the text of the first backreference. \2 inserts the second backreference, and so on.
See also
Metacharacters in ReGex

Simple shell scripting in sh/bash.

  1. grep : Used to grab strings. Use -i to ignore upper-/lower- casese, and -v for inverse match.
    Character Class:
    [[:alpha:]] is [A-Za-z], [[:digit:]] is [0-9], [[:upper:]] is [A-Z], [[:lower:]] is [a-z],
    [[:alnum:]] is [0-9A-Za-z], [[:blank:]] is either blank space or tab, and [[:xdigit:]] is [0-9A-Fa-f].

    [[:punct:]] are ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

    [[:graph:]] matches [:alnum:] and [:punct:]

    [^[:digit:]] matches anything except numbers, and is equivalent to [^0-9]

       Repetition
           A regular expression may be  followed  by  one  of  several  repetition
           operators:
           ?      The preceding item is optional and matched at most once.
           *      The preceding item will be matched zero or more times.
           +      The preceding item will be matched one or more times.
           {n}    The preceding item is matched exactly n times.
           {n,}   The preceding item is matched n or more times.
           {,m}   The  preceding  item  is matched at most m times.  This is a GNU
                  extension.
           {n,m}  The preceding item is matched at least n  times,  but  not  more
                  than m times.
    
    The -E switch enables extended RE (ERE), which is equivalent to egrep.
    The -P switch enables PCRE, which works conveniently with shorthand classes.

    For legacy NCTU student IDs in the format of u1234567 :
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd
    grep "[ugd][[:digit:]a][[:digit:]]\{6\}" /etc/passwd
    grep -E "[udg][0-9a][0-9]{6}" /etc/passwd
    grep -E [udg][0-9a][0-9]{6} /etc/passwd
    egrep [udg][0-9a][0-9]{6} /etc/passwd
    grep -P "[udg][0-9a]\d{6}" /etc/passwd
    grep -P "[udg][\da]\d{6}" /etc/passwd
    All of the above commands give the same result.

    For new-style NYCU student IDs in the format of u123456789 :
    grep "[udg][0-9]\{9\}" /etc/passwd
    grep "[ugd][[:digit:]]\{9\}" /etc/passwd
    grep -E "[udg][0-9]{9}" /etc/passwd
    grep -E [udg][0-9]{9} /etc/passwd
    egrep [udg][0-9]{9} /etc/passwd
    grep -P "[udg]\d{9}" /etc/passwd
    All of the above commands give the same result.


    The -e switch defines multiple patterns. This can be regarded as an equivalence to OR operation.
    To grab both of NCTU and NYCU student ID patterns in /etc/passwd :
    grep -e "[udg][0-9a][0-9]\{6\}" -e "[udg][0-9]\{9\}" /etc/passwd
    grep -e "[ugd][[:digit:]a][[:digit:]]\{6\}" -e "[ugd][[:digit:]]\{9\}" /etc/passwd
    grep -E -e "[udg][0-9a][0-9]{6}" -e "[udg][0-9]{9}" /etc/passwd
    grep -E -e [udg][0-9a][0-9]{6} -e [udg][0-9]{9} /etc/passwd
    egrep -e [udg][0-9a][0-9]{6} -e [udg][0-9]{9} /etc/passwd
    grep -P -e "[udg][0-9a]\d{6}" -e "[udg]\d{9}" /etc/passwd

    or, in this special case, combine the two patterns together:
    grep "[udg][0-9a][0-9]\{6\}\|\{8\}" /etc/passwd

  2. sed : stream editor. Actions are enclosed within a pair of single quotes '  ', usually something like 's/PATTERN1/PATTERN2/g'
    Use -e to combine multiple patterns.
    Practice: Write only one line commands to convert /etc/passwd into HTML file, using sed.
    grep -E -e ^[udg][0-9a][0-9]{6} -e ^[udg][0-9]{9} /etc/passwd | sed -e 's/^/<a href=\"http:\/\/ukko\.life\.nctu\.edu\.tw\/~/' -e 's/:x.*s\//\">/' -e 's/:\/bin.*$/<\/a><br>/'
    Question: What if single quote is one of the patterns?

  3. wc : Word counter
    wc /etc/profile
    28 99 607 /etc/profile means the file has 28 lines, 99 words and 607 characters.

  4. man : Check manual of the commands.
  5. bc : basic calculator , online manual
    echo "3+5" | bc
    echo "3-5" | bc
    echo "3*5" | bc
    echo "3%5" | bc
    echo "3/5" | bc -l
    echo "scale=8; 3/5" | bc -l
    echo "scale=8; l(2.71828)" | bc -l # calculate ln(2.71828)
    echo "scale=20; 4*a(1)" | bc -l    # calculate π with 4×arctan(1) to the 20th digit.
    bc -l <<< "scale=20; 4*a(1)"        # HERE-STRING requires string as stdin 
    bc -l <<< echo "scale=20; 4*a(1)"   # incorrect, because stdout cannot be used as input
    bc -l <<< `echo "scale=20; 4*a(1)"` # backquotes ` ` or $( ) converts stdout into string 
    bc -l <<< $(echo "scale=20; 4*a(1)")   # it is suggested to use $( ) instead of backquotes ` `
    
    Useful resource: http://x-bc.sourceforge.net .

  6. formatting output in bash with awk
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $3 $4}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $3, $4}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $4, $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | cut -d: -f3,4
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | cut -d: -f4,3  
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $3 " " $4}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "\n" $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%d", $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%d\n", $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%4d\n", $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%5d\n", $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d\n", $3}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d,%05d\n", $3, $4}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d|%05d\n", $3, $4}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d\\%05d\n", $3, $4}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d%%%05d\n", $3, $4}'
    grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d\x27%05d\n", $3, $4}'
    
  7. More about single quotes and double quotes.

  8. Sample bash code for the sum of infinite series 1+2+3+4+5+...+n , using a (stupid) loop:
    #!/bin/bash 
    counter=$1               # comments start with hashtag only
    sum=0                    # initial value of $sum
    while [ $counter -gt 0 ] # -gt means grater than 
    do
       sum=$(( $sum + $counter ))   # $(( )) performs interger math under bash
       counter=$(( $counter - 1 ))
    done
    echo $sum
    # end of code 
    

    If you save the bash script above as ~/sum.bash , do
    chmod +x ~/sum.bash
    before you can calculate 1+2+3+4+5+...+10 by
    ~/sum.bash 10

    Question: How to make the code run faster?

    Practice: Write a bash script to calculate the factorial n! .

  9. seq : sequence generator for integers
    seq 1 10               # print integers 1 to 10, one integer per line
    seq -s " " 1 10        # print integers 1 to 10 in one single line separated by one space
    seq -s " " 1 2 11      # 1 3 5 7 9 11
    seq -s " " 10 -1 1     # 10 9 8 7 6 5 4 3 2 1
    seq 1 10 | tr -d "\n"  # 12345678910 without newline
       

  10. Special variables with dollar sign prefix in shells

Example files:
Under the directory of ~jsyu/Example

-rw-r--r-- 1 jsyu users   270 Mar 12 12:45 h2o_freq.com
-rw-r--r-- 1 jsyu users 23114 Mar 12 12:46 h2o_freq.log
-rw-r--r-- 1 jsyu users   290 Mar 12 12:43 h2o_xyz.gjf
-rw-r--r-- 1 jsyu users 30632 Mar 12 12:43 h2o_xyz.log

Copy them back into your home directory.
/usr/bin/iconv
主要參數有
-f   原始文字的編碼
-t   欲輸出的文字編碼
-l   列出已知編碼字元集

例: iconv -f big5  -t utf8  test.big5.txt  > test.utf8.txt
big5-->utf8,這行指令就能把內容為big5編碼檔 test.big5.txt
轉換成utf8 並輸出為 test.utf8.txt