Review in Unix commands, plus uniq and sort:

Week 06

List of Unix commands defined by IEEE Std 1003.1-2008.

Review:Linux commands taught until this week:

Change password: passwd
About user status: w, who, last, finger, top
File and directory operations: ls, mkdir, cd, rmdir, rm, cp, mv, cat, chmod, chown, chgrp
Numerical representation of the file permission using octal digits; sticky bit and suid bit.
File type: file, ldd
Checksum: cksum, md5sum, shasum, sha256sum, sha512sum
Compress: gzip, bzip2, xz, zstd
Packing files: tar, and its combination with compression
*Filesystem: mount, df, du
Standard output paging: more, less
Pattern matching: grep with -e, -i, -v, -c, -A, -B and -E
Line operations of files: wc, head, tail
Common and difference: diff, comm
Manual: man and info
*System Information: uname, uptime, dmesg, lspci, lsmod
*Network Related services: ssh, sftp, telnet, ftp, wget, lynx
Setting environment variables: set, setenv, echo
- Environment variables: $PATH, $path, $LANG, $TERM
- Shell startup files: .profile and .tcshrc
Basic calculator: bc
X-windows applications: xterm, gnome-terminal, xeyes, xclock

Formatting output in `bash`:

Find unique data record using `uniq` :

cat ~jsyu/202109-10.log | uniq         # useless
cat ~jsyu/202109-10.log | uniq -c      # print counts
cat ~jsyu/202109-10.log | uniq -u      # only print unique lines
awk -F" " '{print $1}' ~jsyu/202109-10.log | uniq -c

Other frequently used options of uniq : -i, -s and -f

Sorting data using `sort` :

Frequently used options: -t, -k and -g

Sort /etc/passwd according to $USER and $UID.

cat /etc/passwd | sort                   # sort data according to the ASCII code
cat /etc/passwd | sort -t : -k 1         # -t specifies the field separator, -k gives the location
cat /etc/passwd | sort -t : -k 3 -g      # -g activates the numerical sort
egrep '\<^[udg][0-9a][0-9]{6}\>' /etc/passwd | sort -t : -k 3 -g

You can temporarily set environmental variable LC_ALL=C before doing sort to avoid decoding problems in the locale, for example,
LC_ALL=C sort -t : -k 5 /etc/passwd

What is the difference between the two commands below? Why?
- awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c


 awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c


     39 d0887201
     62 d410351001
     35 g309351018
      4 g310351020
      6 g310352017
     55 jsyu
     62 m309351015
      3 ta003
      2 tachem
     59 u0617087
     36 u0717021
     18 u0717022
     36 u0717032
      6 u0717033
     24 u0717075
     67 u0817001
     48 u0817018
     38 u0817037
      2 u0817109


 wc -l ~jsyu/202109-10.log

602


echo $(awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c | awk -F" " '{printf $1"+"}' | sed -e 's/+$//' ) | bc


bc <<< $(awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c | awk -F" " '{printf $1"+"}' | sed -e 's/+$//' )

602


awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c | sort -t " " -k 1 -g  -r     # -r activates reverse sort


     67 u0817001
     62 m309351015
     62 d410351001
     59 u0617087
     55 jsyu
     48 u0817018
     39 d0887201
     38 u0817037
     36 u0717032
     36 u0717021
     35 g309351018
     24 u0717075
     18 u0717022
      6 u0717033
      6 g310352017
      4 g310351020
      3 ta003
      2 u0817109
      2 tachem

Use of `date` : ^{See also http://wild.life.nctu.edu.tw/class/common/unix/unix-date.txt.html.}

date
Tue Nov 1 14:43:40 CST 2022     # Current date and time

date --date='Today'
Tue Nov 1 14:43:40 CST 2022     # Same as above; CST is Central Standard Time

date --date='2 days ago'
Sun Oct 30 14:52:38 CST 2022     # Two days ago from current time and date

date --date='01/01/1970'
Thu Jan 1 00:00:00 CST 1970     # Pay attention to your local timezone

TZ=UTC date --date='01/01/1970'
date -u --date='01/01/1970'
Thu Jan 1 00:00:00 UTC 1970     # Epoch Time, a.k.a. the birth moment of Unix, the zeroth second.

date --date='01/01/1970 UTC'
date --date='01/01/1970 GMT'
Thu Jan 1 08:00:00 CST 1970     # UTC v.s. GMT, see this link.

date --date='01/01/1970 GMT+8'
Thu Jan 1 00:00:00 CST 1970     # GMT+8 is CST

date --date='01/01/1970 CST'
Thu Jan 1 00:00:00 CST 1970

date --date='10/31/2022'
Mon Oct 31 00:00:00 CST 2022

date +"%s"
1667285898     # currnt time in seconds since 00:00 Jan 01 1970 in local timezone

TZ=UTC date +"%s"
1667285898     # seconds since 00:00 Jan 01 1970 in UTC(GMT)

date --date='Today' +"%s"
1667285898

date --date='10/31/2022' +"%s"
1667145600

date -d @1667145600
Mon Oct 31 00:00:00 CST 2022

TZ=UTC date -d @1667145600
Sun Oct 30 16:00:00 UTC 2022

echo $( date --date='10/31/2022' +"%s" )
1667145600     # convert stdout to string

date -d @$(( $(date --date='00:01 10/31/2022' +"%s") + 60*60*24*7*12 ))
Mon Jan 23 00:01:00 CST 2023     # 12 weeks later from Oct 31 2022 for the next COVID-19 vaccination

date +%Y-%m-%d_%H:%M:%S
echo $( date +%Y-%m-%d_%H:%M:%S )
2022-11-01_18:16:24     # Can be used as (partial) name of the temporary files

Exercise 1: Write a shell script to do statistics for the user login time at Ukko from Sep. 1 to Oct. 19 in 2021,
according to the file 202109-10.log.
It was generated using the command last -w
....
wtmp begins Sun Jan 3 00:56:58 2021
,yet I already filtered out the entries between Sep. 1 and Oct. 19 for you.

You can download it to your own computer using wget command.
wget http://wild.life.nctu.edu.tw/class/1111430027/202109-10.log
# ↑↑↑ if the webpage is not locked with username and password

wget --http-user=430027 --http-passwd=ukko http://wild.life.nctu.edu.tw/class/1111430027/202109-10.log
# ↑↑↑ supply username and password for the locked webpage

For reference, the answer is:


jsyu       28 Days   2 Hours  54 Minutes
g309351018  7 Days  13 Hours   2 Minutes
d0887201    5 Days   2 Hours   2 Minutes
m309351015  4 Days  20 Hours  35 Minutes
ta003       4 Days  10 Hours  58 Minutes
u0817037    3 Days   8 Hours   8 Minutes
u0817001    2 Days  18 Hours  19 Minutes
u0817018    2 Days  14 Hours  12 Minutes
u0717032    2 Days   4 Hours  58 Minutes
u0717075    2 Days   2 Hours  19 Minutes
d410351001  1 Days  20 Hours  33 Minutes
u0717021    1 Days  18 Hours  19 Minutes
u0617087    1 Days  13 Hours  33 Minutes
u0717022    1 Days   8 Hours  58 Minutes
g310352017  0 Days   3 Hours  18 Minutes
u0717033    0 Days   3 Hours  14 Minutes
u0817109    0 Days   3 Hours  13 Minutes
g310351020  0 Days   3 Hours   1 Minutes
tachem      0 Days   0 Hours   3 Minutes

Exercise 2: Sort the table of atoms according to atom names, symbols, atomic mass and atomic numbers, respectively.
Data source: http://www.csudh.edu/oliver/chemdata/atmass.htm

Sample loop:

for i in `cat /etc/passwd | cut -d : -f 1` 
   do echo $i ;
   finger $i ;
done

head and tail:

Array in `bash`

Assign the content of "Lin Dong Tsai Hsieh Yu" to an array of five elements, then print the print array contents. Array index in bash starts with 0.

LASTNAME="Lin Dong Tsai Hsieh Yu"  string input
arr=($LASTNAME)                    assign string to the array named arr
echo ${arr[0]} ${arr[3]}           print the values of arr[0] and arr[3]
echo ${arr[*]}                     print all the values in array arr
echo ${!arr[*]}                    print the indices of array arr
echo ${#arr[*]}                    print the lengths of array arr
echo ${#arr[@]}                    also print the lengths of array arr
b=`seq 1 9`                        b is a string
brr=($b)                           from string variable $b to array ${brr[*]} 
brr=(`seq 1 9`)                    brr is an array
arr=(`echo "Lin Dong Tsai Hsieh Yu"`)assign array via stdout

The "@" sign can be used instead of the "*" in constructs such as ${arr[*]}, the result is the same except when expanding to the items of the array within a quoted string. In this case the behavior is the same as when expanding "$*" and "$@" within quoted strings: "${arr[*]}" returns all the items as a single word, whereas "${arr[@]}" returns each item as a separate word. For further explanations, see "Bash Arrays", Linux Journal, Jun 19 2008, written by Mitch Frazier.

Associative array in `bash`, a.k.a. hash table

Arithmetic operations in `bash`

Arithmetic operations of bash are performed inside $(( .... )) and only applies to integers!

echo $(( 7*8 ))                    56
a="13"
b="5"
c="8.9"
echo $(($a+$b))                    18
echo $(($a*$b))                    65
echo $(($a/$b))                    2
echo $(($a* $(($b-$c))))           bash: 5-8.9: syntax error in expression (error token is ".9")
echo "$a*($b-$c)" | bc -l          -50.7
echo $(($a%$b))                    3
echo ${c/./,}                      8,9
echo -n $c                         NO NEWLINE

End of Week 07-08

Delete blank lines in the file.

sed -e '/^[ \t] *$/d' -e '/^$/d' file

Exercise 1: Write a shell script to convert 1-letter code to 3-letter code for amino acids. Example website. You may find tables in the Wiki or use the file. Also calcuate its molecular weight in the units of kDa and g/mol.
Exercise 2: Write a shell script to convert convert 3-letter code to 1-letter code for amino acids.
Exercise 3: Calculate the GC contents for an input sequence of DNA.

Internet references:

ERE

bash script, 中文參考資料

Regular Expression Regex 中文介紹 by 朝陽科技大學資訊管理系洪朝貴教授

awk manual, GNU awk user guide

Convert PDB file to other formats:

Protein Data Bank
PDB file format

Crystal structure of chitosanase from Bacillus circulans MH-K1 at 1.6 Å resolution and its substrate recognition mechanism.
Original PDB file 1QGI.pdb before conversion.
After conversion: 1QGI.gjf in gaussian format.
and: 1QGI_3.gjf in gaussian ONIOM 3-layer format.

grep "^ATOM " 1QGI.pdb | awk -F " " '{ print  $NF$2, $7, $8, $9}'


grep "^ATOM " 1QGI.pdb | awk -F " " '{ printf "%s, %-2.4f, %-2.4f, %-2.4f \n" $NF,$2, $7, $8, $9}'


grep "^ATOM " 1QGI.pdb | awk -F " " '{ print  $NF$2, $7, $8, $9}'|sed 's/ /\(Fragment\=1\)\ /'



grep "^FORMUL " 1QGI.pdb


grep "^FORMUL " 1QGI.pdb | wc


grep "^FORMUL " 1QGI.pdb | wc -l


grep "^FORMUL " 1QGI.pdb | awk -F " " '{print $3}'


grep "^FORMUL " 1QGI.pdb | awk -F " " '{printf $3}'


grep "^FORMUL " 1QGI.pdb | head -1 | awk -F " " '{printf $3}'


grep "^FORMUL " 1QGI.pdb | head -1 | tail -1 | awk -F " " '{printf $3}'


grep "^FORMUL " 1QGI.pdb | head -2 | tail -1 | awk -F " " '{printf $3}'


grep "^HETATM " 1QGI.pdb | grep GCS



grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -1 | tail -1 | awk -F " " '{printf $3}'`


grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -1 | tail -1 | awk -F " " '{printf $3}'` | awk -F " " '{ print  $NF$2, $7, $8, $9}'|sed 's/ /\(Fragment\=2\)\ /'

Set variable from commands:

NUM_HETGRPS=`grep "^FORMUL " 1QGI.pdb | wc -l`

Loop in bash

for i in `cat /etc/passwd | cut -d : -f 1` 
   do echo $i ;
   finger $i ;
done



NUM_HETGRPS=`grep "^FORMUL " 1QGI.pdb | wc -l`
HETGRPS=1
while (("$HETGRPS" <= $NUM_HETGRPS))
  do grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -$HETGRPS | tail -1 | awk -F " " '{printf $3}'` |  awk -F " " '{ print  $NF$2"(Fragment="2")", $7, $8, $9}'
     HETGRPS=$((HETGRPS+1))
done



NUM_HETGRPS=`grep "^FORMUL " 1QGI.pdb | wc -l`
HETGRPS=1
while (("$HETGRPS" <= $NUM_HETGRPS))
  do grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -$HETGRPS | tail -1 | awk -F " " '{printf $3}'` |  awk  -F " " '{ print  $NF$2"(Fragment="HETGRPS")", $7, $8, $9}' HETGRPS=$HETGRPS
     HETGRPS=$((HETGRPS+1))
done



NUM_HETGRPS=`grep "^FORMUL " 1QGI.pdb | wc -l`
HETGRPS=1
while (("$HETGRPS" <= $NUM_HETGRPS))
  do grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -$HETGRPS | tail -1 | awk -F " " '{printf $3}'` |  awk  -F " " '{ print  $NF$2"(Fragment="HETGRPS")", $7, $8, $9}' HETGRPS=$((HETGRPS+1))
     HETGRPS=$((HETGRPS+1))
done

Gaussian header: .gjf

#  PM6
空行
Title
空行
0,1
< xyz format follows >
空行

Gaussian header for ONIOM layer format: .gjf

#  ONIOM(MP2/6-31G:HF/6-31G:PM6) 
空行 
Title
空行
0,1
< xyz format follows with extra L,M or H flag>
空行

Week 06

Formatting output in bash:

Find unique data record using uniq :

Sorting data using sort :

Use of date : See also http://wild.life.nctu.edu.tw/class/common/unix/unix-date.txt.html.

Array in bash

Associative array in bash, a.k.a. hash table

Arithmetic operations in bash