sed and awk for genomics | MacManes Research

In my continuing quest to conquer fastQ, fastA, sam & bam files, I have accumulated several useful ‘tools’. Many of them are included in other software packages (e.g. SAMtools), but for some tasks, especially file management and conversion, no standard toolkit exists, and instead researchers script their own solution.

For me, sed and awk, along with a few other standard *nix tools have been extremely useful, a few of them are generally useful, I think… In hopes of helping others, I’m going to start a list, here. The initial list is short (these were the ones I used in the past week or so, fresh in my mind), but I’ll plan to add and update as I discover new, better ways of handling these files. MOst all of these were inspired by stuff I’ve seen on the internet, so if you have something you want to share.. please add it in the comments!

#Create an unwrapped fasta file, you should add ambiguous nucleotides to the search if you have them, also, fasta defline ending in ‘ACTGNn-‘ will trip the script up, so use caution…
sed -i ':begin;$!N;/[ACTGNn-]\n[ACTGNn-]/s/\n//;tbegin;P;D' test.fa
#Add /1 to end of fastQ defline, good for assemblers that need those to tell left and right files. Note that you should change [0-9] to whatever follows your @– this is machine specific.
sed -i 's_^@[0-9]:.*_&/1_g' left.fq sed -i 's_^@[0-9]:.*_&/2_g' right.fq
#Deinterleave a shuffled fastq file
sed -n '/2:N/{N;N;N;p;}' shuff.fastq > right.fq sed -n '/1:N/{N;N;N;p;}' shuff.fastq > left.fq
##Get the sum of a column (Column 2 ($2)) in this case
awk '{add +=$2} END {print add}'
#Get SD of a column (Column 2 ($2) in this case
awk 'NR>2 {sum+=$2; array[NR]=$2} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))^2);}print sqrt(sumsq/NR)}'
#Print mean of column (Column 2 ($2)) in this case
awk '{sum+=$2} END { print "Average = ",sum/NR}'
#Remove duplicate entries in column 10, and print new spreadsheet at unique10.txt
cat test.txt | sort -k10 | awk '!a[$10]++' > unique10.txt
#Retain entries in speadsheet where column 3 is greater than 300 (imagine, culling by sequence length)
cat table.txt | awk '300>$3{next}1' > morethan300.txt
#Extract fastQ from BAM for a bunch of BAM files– requires Picard. (sorry, no sed or awk in this one)
for i in `ls *bam`; do F=`basename $i .bam`; java -Xmx9g -jar ~/software/SamFormatConverter.jar MAX_RECORDS_IN_RAM=1000000 INPUT=$i OUTPUT=/dev/stdout | java -Xmx9g -jar ~/software/SamToFastq.jar MAX_RECORDS_IN_RAM=1000000 INPUT=/dev/stdin FASTQ=$F.1.fq SECOND_END_FASTQ=$F.2.fq; done