Week 03

ASCII code table

Reguar Expressions (RE or regexp)

Review:

^ and $ are beginning and end of line, respectively.
Brackets [] define the range. For example, numbers are [0-9] while alphabets are [A-Za-z].
^ inside [] after [ means inverse match. For example, [^0-9] means NOT numbers.
Escape with \ and ^V.
1. \s \t \r \n
2. Shorthand character classes (only works in vimregex and Perl compatible RE, PCRE ):
  \d , \a and \w match [0-9], [A-Za-z] and [A-Za-z0-9_], respectively.
  \D , \A and \W are the inverse matches of them.
Pattern starts with \< and ends with \>
Pattern recurrs n times with \{n\}
& is the pattern that matched.
. denotes anything except \n, for example, p.p matches pip, pap, p2p or even p◻p.
* denotes for "appears zero or more times". For example, pis* matches pi, pis, piss, pisss and more s.
.* matches anything that appears for any times or not at all.

Search patterns: /, ?, *, #, \, \>, \c, ^ and $.
Escape characters in vim (backslash \ and Ctrl-V).

When searching:

., *, \, [, ], ^, and $ are metacharacters.

+, ?, |, {, }, (, and ) must be escaped to use their special function.

\/ is / (use backslash + forward slash to search for forward slash)

\t is tab, \s is whitespace

\n is newline, \r is CR (carriage return = Ctrl-M = ^M)

\{#\} is used for repetition. /foo.\{2\} will match foo and the two following characters. The \ is not required on the closing } so /foo.\{2} will do the same thing.

$foo$ makes a backreference to foo. Parenthesis without escapes are literally matched. Here the \ is required for the closing \).

When replacing:

\r is newline, \n is a null byte (0x00).

\& is ampersand ( & is the text that matches the search pattern ).

\0 inserts the text matched by the entire pattern

\1 inserts the text of the first backreference. \2 inserts the second backreference, and so on.

Simple shell scripting in sh/bash.

grep : Used to grab strings. Use -i to ignore upper-/lower- casese, and -v for inverse match.
Character Class:
[[:alpha:]] is [A-Za-z], [[:digit:]] is [0-9], [[:upper:]] is [A-Z], [[:lower:]] is [a-z],
[[:alnum:]] is [0-9A-Za-z], [[:blank:]] is either blank space or tab, and [[:xdigit:]] is [0-9A-Fa-f].

[[:punct:]] are ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

[[:graph:]] matches [:alnum:] and [:punct:]

[^[:digit:]] matches anything except numbers, and is equivalent to [^0-9]
```
   Repetition
       A regular expression may be  followed  by  one  of  several  repetition
       operators:
       ?      The preceding item is optional and matched at most once.
       *      The preceding item will be matched zero or more times.
       +      The preceding item will be matched one or more times.
       {n}    The preceding item is matched exactly n times.
       {n,}   The preceding item is matched n or more times.
       {,m}   The  preceding  item  is matched at most m times.  This is a GNU
              extension.
       {n,m}  The preceding item is matched at least n  times,  but  not  more
              than m times.
```
The -E switch enables extended RE (ERE), which is equivalent to egrep.
The -P switch enables PCRE, which works conveniently with shorthand classes.

For legacy NCTU student IDs in the format of u1234567 :
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd grep "[ugd][[:digit:]a][[:digit:]]\{6\}" /etc/passwd grep -E "[udg][0-9a][0-9]{6}" /etc/passwd grep -E [udg][0-9a][0-9]{6} /etc/passwd egrep [udg][0-9a][0-9]{6} /etc/passwd grep -P "[udg][0-9a]\d{6}" /etc/passwd grep -P "[udg][\da]\d{6}" /etc/passwd All of the above commands give the same result.

For new-style NYCU student IDs in the format of u123456789 :
grep "[udg][0-9]\{9\}" /etc/passwd grep "[ugd][[:digit:]]\{9\}" /etc/passwd grep -E "[udg][0-9]{9}" /etc/passwd grep -E [udg][0-9]{9} /etc/passwd egrep [udg][0-9]{9} /etc/passwd grep -P "[udg]\d{9}" /etc/passwd All of the above commands give the same result.

The -e switch defines multiple patterns. This can be regarded as an equivalence to OR operation.
To grab both of NCTU and NYCU student ID patterns in /etc/passwd :
grep -e "[udg][0-9a][0-9]\{6\}" -e "[udg][0-9]\{9\}" /etc/passwd grep -e "[ugd][[:digit:]a][[:digit:]]\{6\}" -e "[ugd][[:digit:]]\{9\}" /etc/passwd grep -E -e "[udg][0-9a][0-9]{6}" -e "[udg][0-9]{9}" /etc/passwd grep -E -e [udg][0-9a][0-9]{6} -e [udg][0-9]{9} /etc/passwd egrep -e [udg][0-9a][0-9]{6} -e [udg][0-9]{9} /etc/passwd grep -P -e "[udg][0-9a]\d{6}" -e "[udg]\d{9}" /etc/passwd
or, in this special case, combine the two patterns together:
grep "[udg][0-9a][0-9]\{6\}\|\{8\}" /etc/passwd
sed : stream editor. Actions are enclosed within a pair of single quotes ' ', usually something like 's/PATTERN1/PATTERN2/g'
Use -e to combine multiple patterns.
Practice: Write only one line commands to convert /etc/passwd into HTML file, using sed.
grep -E -e ^[udg][0-9a][0-9]{6} -e ^[udg][0-9]{9} /etc/passwd | sed -e 's/^/<a href=\"http:\/\/ukko\.life\.nctu\.edu\.tw\/~/' -e 's/:x.*s\//\">/' -e 's/:\/bin.*$/<\/a><br>/'
Question: What if single quote is one of the patterns?
wc : Word counter
wc /etc/profile
28 99 607 /etc/profile means the file has 28 lines, 99 words and 607 characters.
man : Check manual of the commands.

bc : basic calculator , ^{online manual}

echo "3+5" | bc
echo "3-5" | bc
echo "3*5" | bc
echo "3%5" | bc
echo "3/5" | bc -l
echo "scale=8; 3/5" | bc -l
echo "scale=8; l(2.71828)" | bc -l # calculate ln(2.71828)
echo "scale=20; 4*a(1)" | bc -l    # calculate π with 4×arctan(1) to the 20^th digit.
bc -l <<< "scale=20; 4*a(1)"        # HERE-STRING requires string as stdin 
bc -l <<< echo "scale=20; 4*a(1)"   # incorrect, because stdout cannot be used as input
bc -l <<< `echo "scale=20; 4*a(1)"` # backquotes ` ` or $( ) converts stdout into string 
bc -l <<< $(echo "scale=20; 4*a(1)")   # it is suggested to use $( ) instead of backquotes ` `

Useful resource: http://x-bc.sourceforge.net .

formatting output in bash with awk

grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $3 $4}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $3, $4}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $4, $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | cut -d: -f3,4
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | cut -d: -f4,3  
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{print $3 " " $4}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "\n" $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%d", $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%d\n", $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%4d\n", $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%5d\n", $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d\n", $3}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d,%05d\n", $3, $4}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d|%05d\n", $3, $4}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d\\%05d\n", $3, $4}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d%%%05d\n", $3, $4}'
grep "[udg][0-9a][0-9]\{6\}" /etc/passwd | awk -F: '{printf "%05d\x27%05d\n", $3, $4}'

More about single quotes and double quotes.

Sample bash code for the sum of infinite series 1+2+3+4+5+...+n , using a (stupid) loop:

#!/bin/bash 
counter=$1               # comments start with hashtag only
sum=0                    # initial value of $sum
while [ $counter -gt 0 ] # -gt means grater than 
do
   sum=$(( $sum + $counter ))   # $(( )) performs interger math under bash
   counter=$(( $counter - 1 ))
done
echo $sum
# end of code

If you save the bash script above as ~/sum.bash , do
chmod +x ~/sum.bash
before you can calculate 1+2+3+4+5+...+10 by
~/sum.bash 10

Question: How to make the code run faster?

Practice: Write a bash script to calculate the factorial n! .

seq : sequence generator for integers

seq 1 10               # print integers 1 to 10, one integer per line
seq -s " " 1 10        # print integers 1 to 10 in one single line separated by one space
seq -s " " 1 2 11      # 1 3 5 7 9 11
seq -s " " 10 -1 1     # 10 9 8 7 6 5 4 3 2 1
seq 1 10 | tr -d "\n"  # 12345678910 without newline

Special variables with dollar sign prefix in shells

Example files:
Under the directory of ~jsyu/Example

-rw-r--r-- 1 jsyu users   270 Mar 12 12:45 h2o_freq.com
-rw-r--r-- 1 jsyu users 23114 Mar 12 12:46 h2o_freq.log
-rw-r--r-- 1 jsyu users   290 Mar 12 12:43 h2o_xyz.gjf
-rw-r--r-- 1 jsyu users 30632 Mar 12 12:43 h2o_xyz.log

Copy them back into your home directory.

/usr/bin/iconv
主要參數有
-f   原始文字的編碼
-t   欲輸出的文字編碼
-l   列出已知編碼字元集

例： iconv -f big5  -t utf8  test.big5.txt  > test.utf8.txt
big5-->utf8，這行指令就能把內容為big5編碼檔 test.big5.txt
轉換成utf8 並輸出為 test.utf8.txt