trim.flows yields much fewer seqs than trim.seqs

roey.angel · October 21, 2011, 5:07pm

Hi,
I’e been trying to use the new trim.flows + shhh.flows commands to pre-process my 16S bacteria sequences instead of using trim.seqs.

The trouble is that trim.flows ends up yielding far less sequences than trim.seqs (and also shorter ones).
Here are the two summaries:

Using trim.seqs
trim.seqs(fasta=RTL2011.08.fasta, oligos=RTL2011.08.bac.oligos, qfile=RTL2011.08.qual, minlength=200, maxambig=2, maxhomop=8, flip=T, bdiffs=2, pdiffs=3, qwindowaverage=25, qwindowsize=50, processors=4)

Start End NBases Ambigs Polymer
Minimum: 1 200 200 0 3
2.5%-tile: 1 212 212 0 4
25%-tile: 1 282 282 1 4
Median: 1 384 384 1 5
75%-tile: 1 422 422 1 5
97.5%-tile: 1 448 448 1 5
Maximum: 1 497 497 2 8

of Seqs: 310663

Using trim.flows trim.flows(flow=RTL2011.08.flow, oligos=RTL2011.08.bac.oligos, maxhomop=8, bdiffs=2, pdiffs=3, processors=4) shhh.flows(file=RTL2011.08.flow.files, processors=4)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 262 262 0 3 1
2.5%-tile: 1 285 285 0 4 4605
25%-tile: 1 296 296 0 4 46049
Median: 1 302 302 0 4 92098
75%-tile: 1 309 309 0 5 138146
97.5%-tile: 1 322 322 0 5 179590
Maximum: 1 343 343 0 8 184194
Mean: 1 302.665 302.665 0 4.48658

of Seqs: 184194

How would you suggest should I relax the conditions of trim.flows to get more sequences?
by reducing minflows or reducing signal and noise threshold?

Thanks,
Roey

pschloss · October 25, 2011, 11:39am

This makes sense. In trim seqs you are using qwindowaverage of 25 and a window size of 50. This probably doesn’t actually trim many if any of your sequences since the sequencer requires the reads to have a quality average of 20 over the full sequence. You will probably get an error rate with this set up of ~0.6%. In contrast, by default, when you use trim.flows any flowgram shorter than 450 flows is discarded and any longer than 450 flows is trimmed to 450 flows. This makes all of your sequences about 250 bp long. This will drop your error rate to below 0.1%. When you then run pre.cluster, the error rate will drop even further.

Frankly, I wouldn’t suggest reducing the threshold since your error rates will go up. Having more reads at the expense of including crappy data is hard to advocate for. The data I outline above and on the wiki will be available when PLoS ONE publishes our latest manuscript where we analyze 90 mock communities. I’m happy to provide the MS to anyone that emails me. If anyone has suggestions on how to further optimize a pipeline, I can get you in touch with the raw data and they can have at it.

Pat

roey.angel · October 25, 2011, 1:28pm

Thanks for the reply!

Though if I have data from the 454 GS FLX titanium why would I want to trim the seqs so early?
I ran trim.flows(flow=RTL2011.08.flow, oligos=RTL2011.08.arc.oligos, maxhomop=8, bdiffs=2, pdiffs=3, minflows=350, maxflows=750, processors=2)
which gave me a nice output of:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 207 207 0 3 1
2.5%-tile: 1 240 240 0 4 920
25%-tile: 1 347 347 0 4 9196
Median: 1 427 427 0 5 18391
75%-tile: 1 453 453 0 5 27586
97.5%-tile: 1 478 478 0 6 35862
Maximum: 1 529 529 0 9 36781
Mean: 1 397.501 397.501 0 4.77654

of Seqs: 36781

My problem though is still that half of the sequences here are duplicated and indeed if I process it with this tiny perl script:

my $file = ‘~/mothur/Analysis/RTL2011.08/output/ARC/RTL2011.08.shhh.fasta’;
my %seen = ();
{
local @ARGV = ($file);
local $^I = ‘.bac’;
while(<>){
$seen{$}++;
next if $seen{$} > 1;
print;
}
}
print “finished processing file.”;

I get:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 200 200 0 3 1
2.5%-tile: 1 212 212 0 4 460
25%-tile: 1 319 319 0 4 4592
Median: 1 398 398 0 5 9184
75%-tile: 1 426 426 0 5 13776
97.5%-tile: 1 450 450 0 6 17908
Maximum: 1 1320 1320 0 7 18367
Mean: 1 370.154 370.154 0 4.6081

of Seqs: 18367

which is pretty much half

any ideas?

Thanks again
Roey

pschloss · November 8, 2011, 1:37pm

Though if I have data from the 454 GS FLX titanium why would I want to trim the seqs so early?

Because the last 200 bp of the data are detritus. The paper showing this will be out on 11/10 in PLoS ONE. BTW, I can give you another 200 bp beyond what 454 does for a small service fee, but I’d trust it about as much

Can you send us the RTL2011.08.flow and RTL2011.08.arc.oligos files and we can take a look at what’s going on for you?

Pat

Topic		Replies	Views
shhh.flows reduce the length of the reads Commands in mothur	0	3426	June 28, 2012
shhh.flows output is very small! What's wrong? Commands in mothur	5	3299	April 3, 2014
trim.seqs in version 1.24 producing empty fasta mothur bugs	6	6148	January 3, 2014
question about trim.seqs settings Commands in mothur	2	2361	November 2, 2012
trim.flows in version 1.32 producing empty flow.files mothur bugs	4	5591	October 8, 2013

trim.flows yields much fewer seqs than trim.seqs

of Seqs: 310663

of Seqs: 184194

of Seqs: 36781

of Seqs: 18367

Related topics