Week 06

List of Unix commands defined by IEEE Std 1003.1-2008.


Review:Linux commands taught until this week:



Formatting output in bash:

Find unique data record using uniq :

cat ~jsyu/202109-10.log | uniq         # useless
cat ~jsyu/202109-10.log | uniq -c      # print counts
cat ~jsyu/202109-10.log | uniq -u      # only print unique lines
awk -F" " '{print $1}' ~jsyu/202109-10.log | uniq -c 
Other frequently used options of uniq : -i, -s and -f

Sorting data using sort :

Frequently used options: -t, -k and -g
awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c

     39 d0887201
     62 d410351001
     35 g309351018
      4 g310351020
      6 g310352017
     55 jsyu
     62 m309351015
      3 ta003
      2 tachem
     59 u0617087
     36 u0717021
     18 u0717022
     36 u0717032
      6 u0717033
     24 u0717075
     67 u0817001
     48 u0817018
     38 u0817037
      2 u0817109
wc -l ~jsyu/202109-10.log
602

echo $(awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c | awk -F" " '{printf $1"+"}' | sed -e 's/+$//' ) | bc
or
bc <<< $(awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c | awk -F" " '{printf $1"+"}' | sed -e 's/+$//' )
602

awk -F" " '{print $1}' ~jsyu/202109-10.log | sort | uniq -c | sort -t " " -k 1 -g -r    # -r activates reverse sort

     67 u0817001
     62 m309351015
     62 d410351001
     59 u0617087
     55 jsyu
     48 u0817018
     39 d0887201
     38 u0817037
     36 u0717032
     36 u0717021
     35 g309351018
     24 u0717075
     18 u0717022
      6 u0717033
      6 g310352017
      4 g310351020
      3 ta003
      2 u0817109
      2 tachem

Use of date : See also http://wild.life.nctu.edu.tw/class/common/unix/unix-date.txt.html.

date
Tue Nov 1 14:43:40 CST 2022     # Current date and time

date --date='Today'
Tue Nov 1 14:43:40 CST 2022     # Same as above; CST is Central Standard Time

date --date='2 days ago'
Sun Oct 30 14:52:38 CST 2022     # Two days ago from current time and date

date --date='01/01/1970'
Thu Jan 1 00:00:00 CST 1970     # Pay attention to your local timezone

TZ=UTC date --date='01/01/1970'
date -u --date='01/01/1970'
Thu Jan 1 00:00:00 UTC 1970     # Epoch Time, a.k.a. the birth moment of Unix, the zeroth second.

date --date='01/01/1970 UTC'
date --date='01/01/1970 GMT'
Thu Jan 1 08:00:00 CST 1970     # UTC v.s. GMT, see this link.

date --date='01/01/1970 GMT+8'
Thu Jan 1 00:00:00 CST 1970     # GMT+8 is CST

date --date='01/01/1970 CST'
Thu Jan 1 00:00:00 CST 1970

date --date='10/31/2022'
Mon Oct 31 00:00:00 CST 2022

date +"%s"
1667285898     # currnt time in seconds since 00:00 Jan 01 1970 in local timezone

TZ=UTC date +"%s"
1667285898     # seconds since 00:00 Jan 01 1970 in UTC(GMT)

date --date='Today' +"%s"
1667285898     

date --date='10/31/2022' +"%s"
1667145600

date -d @1667145600
Mon Oct 31 00:00:00 CST 2022

TZ=UTC date -d @1667145600
Sun Oct 30 16:00:00 UTC 2022

echo $( date --date='10/31/2022' +"%s" )
1667145600     # convert stdout to string

date -d @$(( $(date --date='00:01 10/31/2022' +"%s") + 60*60*24*7*12 ))
Mon Jan 23 00:01:00 CST 2023     # 12 weeks later from Oct 31 2022 for the next COVID-19 vaccination

date +%Y-%m-%d_%H:%M:%S
echo $( date +%Y-%m-%d_%H:%M:%S )
2022-11-01_18:16:24     # Can be used as (partial) name of the temporary files


Exercise 1: Write a shell script to do statistics for the user login time at Ukko from Sep. 1 to Oct. 19 in 2021,
according to the file 202109-10.log.

It was generated using the command last -w
....
wtmp begins Sun Jan 3 00:56:58 2021

,yet I already filtered out the entries between Sep. 1 and Oct. 19 for you.

You can download it to your own computer using wget command.
wget http://wild.life.nctu.edu.tw/class/1111430027/202109-10.log
# ↑↑↑ if the webpage is not locked with username and password

wget --http-user=430027 --http-passwd=ukko http://wild.life.nctu.edu.tw/class/1111430027/202109-10.log
# ↑↑↑ supply username and password for the locked webpage

For reference, the answer is:

jsyu       28 Days   2 Hours  54 Minutes
g309351018  7 Days  13 Hours   2 Minutes
d0887201    5 Days   2 Hours   2 Minutes
m309351015  4 Days  20 Hours  35 Minutes
ta003       4 Days  10 Hours  58 Minutes
u0817037    3 Days   8 Hours   8 Minutes
u0817001    2 Days  18 Hours  19 Minutes
u0817018    2 Days  14 Hours  12 Minutes
u0717032    2 Days   4 Hours  58 Minutes
u0717075    2 Days   2 Hours  19 Minutes
d410351001  1 Days  20 Hours  33 Minutes
u0717021    1 Days  18 Hours  19 Minutes
u0617087    1 Days  13 Hours  33 Minutes
u0717022    1 Days   8 Hours  58 Minutes
g310352017  0 Days   3 Hours  18 Minutes
u0717033    0 Days   3 Hours  14 Minutes
u0817109    0 Days   3 Hours  13 Minutes
g310351020  0 Days   3 Hours   1 Minutes
tachem      0 Days   0 Hours   3 Minutes



Exercise 2: Sort the table of atoms according to atom names, symbols, atomic mass and atomic numbers, respectively.
Data source: http://www.csudh.edu/oliver/chemdata/atmass.htm

Sample loop:
for i in `cat /etc/passwd | cut -d : -f 1` 
   do echo $i ;
   finger $i ;
done


head and tail:

Array in bash

Assign the content of "Lin Dong Tsai Hsieh Yu" to an array of five elements, then print the print array contents. Array index in bash starts with 0.
LASTNAME="Lin Dong Tsai Hsieh Yu"  string input
arr=($LASTNAME)                    assign string to the array named arr
echo ${arr[0]} ${arr[3]}           print the values of arr[0] and arr[3]
echo ${arr[*]}                     print all the values in array arr
echo ${!arr[*]}                    print the indices of array arr
echo ${#arr[*]}                    print the lengths of array arr
echo ${#arr[@]}                    also print the lengths of array arr
b=`seq 1 9`                        b is a string
brr=($b)                           from string variable $b to array ${brr[*]} 
brr=(`seq 1 9`)                    brr is an array
arr=(`echo "Lin Dong Tsai Hsieh Yu"`)assign array via stdout
The "@" sign can be used instead of the "*" in constructs such as ${arr[*]}, the result is the same except when expanding to the items of the array within a quoted string. In this case the behavior is the same as when expanding "$*" and "$@" within quoted strings: "${arr[*]}" returns all the items as a single word, whereas "${arr[@]}" returns each item as a separate word. For further explanations, see "Bash Arrays", Linux Journal, Jun 19 2008, written by Mitch Frazier.

Associative array in bash, a.k.a. hash table


Arithmetic operations in bash

Arithmetic operations of bash are performed inside $(( .... )) and only applies to integers!
echo $(( 7*8 ))                    56
a="13"
b="5"
c="8.9"
echo $(($a+$b))                    18
echo $(($a*$b))                    65
echo $(($a/$b))                    2
echo $(($a* $(($b-$c))))           bash: 5-8.9: syntax error in expression (error token is ".9")
echo "$a*($b-$c)" | bc -l          -50.7
echo $(($a%$b))                    3
echo ${c/./,}                      8,9
echo -n $c                         NO NEWLINE

End of Week 07-08


Delete blank lines in the file.

sed -e '/^[ \t] *$/d' -e '/^$/d' file

Exercise 1: Write a shell script to convert 1-letter code to 3-letter code for amino acids. Example website. You may find tables in the Wiki or use the file. Also calcuate its molecular weight in the units of kDa and g/mol.
Exercise 2: Write a shell script to convert convert 3-letter code to 1-letter code for amino acids.
Exercise 3: Calculate the GC contents for an input sequence of DNA.

Internet references:


Convert PDB file to other formats:

Protein Data Bank
PDB file format

Crystal structure of chitosanase from Bacillus circulans MH-K1 at 1.6 Å resolution and its substrate recognition mechanism.
Original PDB file 1QGI.pdb before conversion.
After conversion: 1QGI.gjf in gaussian format.
and: 1QGI_3.gjf in gaussian ONIOM 3-layer format.
grep "^ATOM " 1QGI.pdb | awk -F " " '{ print  $NF$2, $7, $8, $9}'

grep "^ATOM " 1QGI.pdb | awk -F " " '{ printf "%s, %-2.4f, %-2.4f, %-2.4f \n" $NF,$2, $7, $8, $9}'
grep "^ATOM " 1QGI.pdb | awk -F " " '{ print $NF$2, $7, $8, $9}'|sed 's/ /\(Fragment\=1\)\ /'

grep "^FORMUL " 1QGI.pdb
grep "^FORMUL " 1QGI.pdb | wc
grep "^FORMUL " 1QGI.pdb | wc -l
grep "^FORMUL " 1QGI.pdb | awk -F " " '{print $3}'
grep "^FORMUL " 1QGI.pdb | awk -F " " '{printf $3}'
grep "^FORMUL " 1QGI.pdb | head -1 | awk -F " " '{printf $3}'
grep "^FORMUL " 1QGI.pdb | head -1 | tail -1 | awk -F " " '{printf $3}'
grep "^FORMUL " 1QGI.pdb | head -2 | tail -1 | awk -F " " '{printf $3}'
grep "^HETATM " 1QGI.pdb | grep GCS

grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -1 | tail -1 | awk -F " " '{printf $3}'`
grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -1 | tail -1 | awk -F " " '{printf $3}'` | awk -F " " '{ print $NF$2, $7, $8, $9}'|sed 's/ /\(Fragment\=2\)\ /'
Set variable from commands:
NUM_HETGRPS=`grep "^FORMUL " 1QGI.pdb | wc -l`

Loop in bash
for i in `cat /etc/passwd | cut -d : -f 1` 
   do echo $i ;
   finger $i ;
done


NUM_HETGRPS=`grep "^FORMUL " 1QGI.pdb | wc -l` HETGRPS=1 while (("$HETGRPS" <= $NUM_HETGRPS)) do grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -$HETGRPS | tail -1 | awk -F " " '{printf $3}'` | awk -F " " '{ print $NF$2"(Fragment="2")", $7, $8, $9}' HETGRPS=$((HETGRPS+1)) done

NUM_HETGRPS=`grep "^FORMUL " 1QGI.pdb | wc -l` HETGRPS=1 while (("$HETGRPS" <= $NUM_HETGRPS)) do grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -$HETGRPS | tail -1 | awk -F " " '{printf $3}'` | awk -F " " '{ print $NF$2"(Fragment="HETGRPS")", $7, $8, $9}' HETGRPS=$HETGRPS HETGRPS=$((HETGRPS+1)) done

NUM_HETGRPS=`grep "^FORMUL " 1QGI.pdb | wc -l` HETGRPS=1 while (("$HETGRPS" <= $NUM_HETGRPS)) do grep "^HETATM " 1QGI.pdb | grep `grep "^FORMUL " 1QGI.pdb | head -$HETGRPS | tail -1 | awk -F " " '{printf $3}'` | awk -F " " '{ print $NF$2"(Fragment="HETGRPS")", $7, $8, $9}' HETGRPS=$((HETGRPS+1)) HETGRPS=$((HETGRPS+1)) done
Gaussian header: .gjf
#  PM6
空行
Title
空行
0,1
< xyz format follows >
空行


Gaussian header for ONIOM layer format: .gjf
#  ONIOM(MP2/6-31G:HF/6-31G:PM6) 
空行 
Title
空行
0,1
< xyz format follows with extra L,M or H flag>
空行