Monthly Archives: April 2013

PeerJ Preprint server for ALL of Biology!

I just got done uploading a manuscript to the PeerJ Preprint server.  It was awesome! This was a manuscript that was outside even the loosest of definitions of quantitative biology, and therefore not appropriate for arXiv.

I am really keen to see biologists using preprint servers, and up until now, where to send non quant. manuscripts was a bit of a problem. Sure, researchers could (and sometime do) host pre-review manuscripts on their own website, but this is problematic for a few reasons.  While I’m not getting into the details here, the issue of visibility (and having a DOI) is huge– how should people find interesting manuscripts? A centralized repository like arXiv or PeerJ is crucial!

Now that the PeerJ preprint server is up and running, it’s up to us to keep it going. I like arXiv for a number of reasons (also, it’s more established), but the PeerJ Preprint server has a few major advantages– commenting (which may fill the role Haldane’s sieve currently fills for arXiv papers), speed (article visible within a few hours), social media integration, and article level metrics (at least basic metrics).

So, the question is this– where to send my NEXT preprint?? Do I send to arXiv, or to PeerJ.. I guess I’m kind of leaning towards PeerJ, unless somebody has a good argument against it.

P.S. I guess there are 3 things that concern me:

  1. What happens to preprints if PeerJ folds– are they lost in the bowels of the internets?
  2. Will Journals consider PeerJ preprints equivalent to those posted on arXiv?
  3. Does Google Scholar index PeerJ preprints? (I think the answer is Yes)

I hate it when GREP doesn’t work like I want it to!

OK, so I have been fighting with GREP all afternoon– Im about to kick it in the teeth! Somehow, there are more lines in my output file than in my subject file.. This is driving me crazy cause I can’t figure out why!!!!

Here is the simple enough command:

query.txt | sort -k1 | awk '{print $1}' | grep -wf - subject.txt > out.txt

>head query.txt
comp10000_c0_seq1 0
comp10002_c0_seq1 0
comp10003_c0_seq1 0
comp10004_c0_seq1 0
comp10005_c0_seq1 0
comp10007_c0_seq1 0
comp1000_c0_seq1 0
comp10011_c0_seq1 0
comp10013_c0_seq1 0
comp10014_c0_seq1 0

>head subject.txt
comp10000_c0_seq1 comp1898_c0_seq2 100.00 5407 0 0 1 5407 1 5407 0.0 9985
comp10002_c0_seq1 comp8374_c0_seq1 100.00 754 0 0 1 754 1 754 0.0 1393
comp10003_c0_seq1 comp8423_c0_seq1 100.00 4387 0 0 1 4387 1 4387 0.0 8102
comp10004_c0_seq1 comp8084_c0_seq1 100.00 3036 0 0 1 3036 1 3036 0.0 5607
comp10005_c0_seq1 comp8387_c0_seq1 100.00 2122 0 0 1 2122 1 2122 0.0 3919
comp10007_c0_seq1 comp8168_c0_seq1 100.00 1141 0 0 1 1141 1 1141 0.0 2108
comp1000_c0_seq1 comp23962_c0_seq1 100.00 326 0 0 1 326 1 326 2e-172 603
comp10011_c0_seq1 comp2125_c0_seq1 100.00 333 0 0 1 333 718 386 3e-176 616
comp10013_c0_seq1 comp8442_c0_seq1 100.00 2745 0 0 1 2745 1 2745 0.0 5070
comp10014_c0_seq1 comp8362_c0_seq1 100.00 1335 0 0 1 1335 1 1335 0.0 2466

>head out.txt
comp10000_c0_seq1 comp1898_c0_seq2 100.00 5407 0 0 1 5407 1 5407 0.0 9985
comp10002_c0_seq1 comp8374_c0_seq1 100.00 754 0 0 1 754 1 754 0.0 1393
comp10003_c0_seq1 comp8423_c0_seq1 100.00 4387 0 0 1 4387 1 4387 0.0 8102
comp10004_c0_seq1 comp8084_c0_seq1 100.00 3036 0 0 1 3036 1 3036 0.0 5607
comp10005_c0_seq1 comp8387_c0_seq1 100.00 2122 0 0 1 2122 1 2122 0.0 3919
comp10007_c0_seq1 comp8168_c0_seq1 100.00 1141 0 0 1 1141 1 1141 0.0 2108
comp1000_c0_seq1 comp23962_c0_seq1 100.00 326 0 0 1 326 1 326 2e-172 603
comp10011_c0_seq1 comp2125_c0_seq1 100.00 333 0 0 1 333 718 386 3e-176 616
comp10013_c0_seq1 comp8442_c0_seq1 100.00 2745 0 0 1 2745 1 2745 0.0 5070
comp10014_c0_seq1 comp8362_c0_seq1 100.00 1335 0 0 1 1335 1 1335 0.0 2466

>wc -l query.txt subject.txt out.txt
22885 query.txt
23560 subject.txt
23560 out.txt

So in theory, query is a subset of subject, so there should be no more than 22885 hits in the outfile.. there should be no duplicates using the -w option in GREP..

Nevertheless, I scanned these files for duplicates, and found none…


cat query.txt | sort -k1 | awk '!a[$1]++' | wc -l
22885
cat subject.txt | sort -k1 | awk '!a[$1]++' | wc -l
23560
cat out.txt | sort -k1 | awk '!a[$1]++' | wc -l
23560

No duplicates…

So I’m stumped..

Improving transcriptome assembly through error correction of high-throughput sequence reads

I am writing this blog post in support of a paper that I have just submitted to arXiv: Improving transcriptome assembly through error correction of high-throughput sequence reads. My goal is not to talk about the nuts and bolts of the paper so much as it is to ramble about its motivation and the writing process.

First, a little bit about me, as this is my 1st paper with my postdoctoral advisor, Mike Eisen. In short, I am a evolutionary biologist by training, having done my PhD on the relationship between mating systems and immunogenes in wild rodents. My postdoc work focuses on adaptation to desert life in rodents- I work on Peromyscus rodents in the Southern California deserts, combining field work and genomics. My overarching goals include the ability to operate in multiple domains– genomics, field biology, evolutionary biology to better understand basic questions– the links between genotype and phenotype, adaptation, etc… OK, enough.. on the the paper.

Abstract:

The study of functional genomics–particularly in non-model organisms has been dramatically improved over the last few years by use of transcriptomes and RNAseq. While these studies are potentially extremely powerful, a computationally intensive procedure–the de novo construction of a reference transcriptome must be completed as a prerequisite to further analyses. The accurate reference is critically important as all downstream steps, including estimating transcript abundance are critically dependent on the construction of an accurate reference. Though a substantial amount of research has been done on assembly, only recently have the pre-assembly procedures been studied in detail. Specifically, several stand-alone error correction modules have been reported on, and while they have shown to be effective in reducing errors at the level of sequencing reads, how error correction impacts assembly accuracy is largely unknown. Here, we show via use of a simulated dataset, that applying error correction to sequencing reads has significant positive effects on assembly accuracy, by reducing assembly error by nearly 50%, and therefore should be applied to all datasets.

For the past couple of years, I have had an interest in better understanding the dynamics of de novo transcriptome assembly.. I had mostly selfish/practical reasons for wanting to understand–a large amount of my work depends on getting these assemblies ‘right’..  It was quickly evident that much of the computational research is directed at assembly itself, and very little on the pre- and post-assembly processes.. We know these things are important, but often an understanding of their effects is lacking…

How error correction of sequencing reads affects assembly accuracy has been one of the specific ideas I’ve been interested in thinking about for the past several months. The idea of simulating RNAseq reads, applying various error corrections, then understanding their effects is logical– so much so that I was really surprised that this has not been done before. So off I went..

I wrote this paper over the coarse of a couple of weeks. It is a short and simple paper, and was quite easy to write. Of note, about 75% of the paper was written on the playground in the UC Berkeley University Village, while (loosely) providing supervision for my 2 youngest daughters.  How is that for work-life balance!

The read data will be available on Figshare, and I owe thanks to those guys for lifting the upload limit- the read file is 2.6Gb with .bz2 compression, so not huge, but not small either. The winning (AllPathsLG corrected) assembly is there as well.

This type of work is inspired, in a very real sense, by C. Titus Brown, who is quickly becoming to be the go-to guy for understanding the nuts and bolts of genome assembly (and also got tenure based on his klout score HA!). His post and paper on The challenges of mRNAseq analysis is the type of stuff that I aspire to…

Anyway, I’d be really interested in hearing what you all think of the paper, so read, enjoy, commentand get to error correcting those reads!

 

UPDATE 1: The paper has made it to Haldane’s Sieve: http://haldanessieve.org/2013/04/04/improving-transcriptome-assembly-through-error-correction-of-high-throughput-sequence-reads/ and http://haldanessieve.org/2013/04/05/our-paper-improving-transcriptome-assembly-through-error-correction-of-high-throughput-sequence-reads/

Question about Structural Variants: These seem to be, like always, poorly reconstructed from de novo assemblies.. No worse with error corrected reads, but no better. If fact, contigs that are ‘full of errors’ are almost always those with complicated structural variation…

Real reads would behave differently than simulated reads. This certainly could be the case (but I don’t think so).. What I can say is that the 1st iteration of this experiment was done using reads from from the SRA– Homo.. The reason I did not use that experiment in the end is that it was hard to tell error from polymorphism relative to the reference.. but the patterns were the same. Fewer differences in error corrected reads relative to the the reference than in raw reads.. So, I do not think that the results I see are a result of the artificial nature of the simulated reads.