Category Archives: Bioinformatics

sed and awk for genomics

In my continuing quest to conquer fastQ, fastA, sam & bam files, I have accumulated several useful ‘tools’. Many of them are included in other software packages (e.g. SAMtools), but for some tasks, especially file management and  conversion, no standard toolkit exists, and instead researchers script their own solution.

For me, sed and awk, along with a few other standard *nix tools have been extremely useful, a few of them are generally useful, I think… In hopes of helping others, I’m going to start a list, here. The initial list is short (these were the ones I used in the past week or so, fresh in my mind), but I’ll plan to add and update as I discover new, better ways of handling these files. MOst all of these were inspired by stuff I’ve seen on the internet, so if you have something you want to share.. please add it in the comments!

#Create an unwrapped fasta file, you should add ambiguous nucleotides to the search if you have them, also, fasta defline ending in ‘ACTGNn-‘ will trip the script up, so use caution…

sed -i ':begin;$!N;/[ACTGNn-]\n[ACTGNn-]/s/\n//;tbegin;P;D' test.fa

#Add /1 to end of fastQ defline, good for assemblers that need those to tell left and right files. Note that you should change [0-9] to whatever follows your @– this is machine specific.

sed -i 's_^@[0-9]:.*_&/1_g' left.fq
sed -i 's_^@[0-9]:.*_&/2_g' right.fq

#Deinterleave a shuffled fastq file

sed -n '/2:N/{N;N;N;p;}' shuff.fastq > right.fq
sed -n '/1:N/{N;N;N;p;}' shuff.fastq > left.fq

##Get the sum of a column (Column 2 ($2)) in this case

awk '{add +=$2} END {print add}'

#Get SD of a column (Column 2 ($2) in this case

awk 'NR>2 {sum+=$2; array[NR]=$2} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))^2);}print sqrt(sumsq/NR)}'

#Print mean of column (Column 2 ($2)) in this case

awk '{sum+=$2} END { print "Average = ",sum/NR}'

#Remove duplicate entries in column 10, and print new spreadsheet at unique10.txt

cat test.txt | sort -k10 | awk '!a[$10]++' > unique10.txt

#Retain entries in speadsheet where column 3 is greater than 300 (imagine, culling by sequence length)

cat table.txt | awk '300>$3{next}1' > morethan300.txt

#Extract fastQ from BAM for a bunch of BAM files– requires Picard. (sorry, no sed or awk in this one)

for i in `ls *bam`;
do F=`basename $i .bam`;
java -Xmx9g -jar ~/software/SamFormatConverter.jar MAX_RECORDS_IN_RAM=1000000 INPUT=$i OUTPUT=/dev/stdout | java -Xmx9g -jar ~/software/SamToFastq.jar MAX_RECORDS_IN_RAM=1000000 INPUT=/dev/stdin FASTQ=$F.1.fq SECOND_END_FASTQ=$F.2.fq;
done

Improving transcriptome assembly through error correction of high-throughput sequence reads

I am writing this blog post in support of a paper that I have just submitted to arXiv: Improving transcriptome assembly through error correction of high-throughput sequence reads. My goal is not to talk about the nuts and bolts of the paper so much as it is to ramble about its motivation and the writing process.

First, a little bit about me, as this is my 1st paper with my postdoctoral advisor, Mike Eisen. In short, I am a evolutionary biologist by training, having done my PhD on the relationship between mating systems and immunogenes in wild rodents. My postdoc work focuses on adaptation to desert life in rodents- I work on Peromyscus rodents in the Southern California deserts, combining field work and genomics. My overarching goals include the ability to operate in multiple domains– genomics, field biology, evolutionary biology to better understand basic questions– the links between genotype and phenotype, adaptation, etc… OK, enough.. on the the paper.

Abstract:

The study of functional genomics–particularly in non-model organisms has been dramatically improved over the last few years by use of transcriptomes and RNAseq. While these studies are potentially extremely powerful, a computationally intensive procedure–the de novo construction of a reference transcriptome must be completed as a prerequisite to further analyses. The accurate reference is critically important as all downstream steps, including estimating transcript abundance are critically dependent on the construction of an accurate reference. Though a substantial amount of research has been done on assembly, only recently have the pre-assembly procedures been studied in detail. Specifically, several stand-alone error correction modules have been reported on, and while they have shown to be effective in reducing errors at the level of sequencing reads, how error correction impacts assembly accuracy is largely unknown. Here, we show via use of a simulated dataset, that applying error correction to sequencing reads has significant positive effects on assembly accuracy, by reducing assembly error by nearly 50%, and therefore should be applied to all datasets.

For the past couple of years, I have had an interest in better understanding the dynamics of de novo transcriptome assembly.. I had mostly selfish/practical reasons for wanting to understand–a large amount of my work depends on getting these assemblies ‘right’..  It was quickly evident that much of the computational research is directed at assembly itself, and very little on the pre- and post-assembly processes.. We know these things are important, but often an understanding of their effects is lacking…

How error correction of sequencing reads affects assembly accuracy has been one of the specific ideas I’ve been interested in thinking about for the past several months. The idea of simulating RNAseq reads, applying various error corrections, then understanding their effects is logical– so much so that I was really surprised that this has not been done before. So off I went..

I wrote this paper over the coarse of a couple of weeks. It is a short and simple paper, and was quite easy to write. Of note, about 75% of the paper was written on the playground in the UC Berkeley University Village, while (loosely) providing supervision for my 2 youngest daughters.  How is that for work-life balance!

The read data will be available on Figshare, and I owe thanks to those guys for lifting the upload limit- the read file is 2.6Gb with .bz2 compression, so not huge, but not small either. The winning (AllPathsLG corrected) assembly is there as well.

This type of work is inspired, in a very real sense, by C. Titus Brown, who is quickly becoming to be the go-to guy for understanding the nuts and bolts of genome assembly (and also got tenure based on his klout score HA!). His post and paper on The challenges of mRNAseq analysis is the type of stuff that I aspire to…

Anyway, I’d be really interested in hearing what you all think of the paper, so read, enjoy, commentand get to error correcting those reads!

 

UPDATE 1: The paper has made it to Haldane’s Sieve: http://haldanessieve.org/2013/04/04/improving-transcriptome-assembly-through-error-correction-of-high-throughput-sequence-reads/ and http://haldanessieve.org/2013/04/05/our-paper-improving-transcriptome-assembly-through-error-correction-of-high-throughput-sequence-reads/

Question about Structural Variants: These seem to be, like always, poorly reconstructed from de novo assemblies.. No worse with error corrected reads, but no better. If fact, contigs that are ‘full of errors’ are almost always those with complicated structural variation…

Real reads would behave differently than simulated reads. This certainly could be the case (but I don’t think so).. What I can say is that the 1st iteration of this experiment was done using reads from from the SRA– Homo.. The reason I did not use that experiment in the end is that it was hard to tell error from polymorphism relative to the reference.. but the patterns were the same. Fewer differences in error corrected reads relative to the the reference than in raw reads.. So, I do not think that the results I see are a result of the artificial nature of the simulated reads.