Cluster weirdly takes 49 sec to run!

Hi,

I’m a Mothur beginner wondering if it is possible that running the “cluster” command on a simple (not particularly powerfull) computer takes only 49 sec (for 8462 unique seqs)? it is supposed to be a loooong step…

Here is what I did:

mothur > dist.seqs(fasta=majorque.pick.fasta, cutoff=0.15, processors=2)

Output File Name:
majorque.pick.dist

It took 224 to calculate the distances for 8462 sequences.

mothur > cluster(column=majorque.pick.dist, name=majorque.pick.names)

********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


changed cutoff to 0.0540724

Output File Names:
majorque.pick.an.sabund
majorque.pick.an.rabund
majorque.pick.an.list

It took 49 seconds to cluster

Here is what I did from the begining - following the SOP :
Trim.flows
Shhh.flows
Trim.seqs
Unique.seqs
Align.seqs
Screen.seq
Filter.seq
Unique.seq
Pre cluster
Uchime (tested with or without an additional align+filter after chimera removal)
Classify seq
Remove lineage (Mitochondria-Chloroplast-Archaea-Eukarya-unknown)
Dist.seqs
Cluster

Everywhere I did follow the SOP and use the parameters as recommended, except for the following:

  • I used 360-720 as parameters for trim.flows (if using 450 flows, more than 50% of my sequences ended in the scrap file)
  • I did not use “trump=.” when I first filtered the sequences because it shortened the mean length of the sequences from 420 to 257 bp *** by the way I did not find an explanation on the wiki/forum, so any explanation here is also more than welcome, note that the sequences seem to align well, and overlap seems ok***. Anyway, after chimera removal, I tested an aditional align+filter step, this time using “trump=.” and results look the same with cluster command taking 70 sec.

Is there something I’m doing wrong?

Hoping I’m not missing something obvious here,
Thanks in advance for your help,

Marina

Not sure if that’s too fast. I guess it seems reasonable.

As you outed yourself as a “beginner” I’ll offer some unsolicited advice…

Everywhere I did follow the SOP and use the parameters as recommended, except for the following:

  • I used 360-720 as parameters for trim.flows (if using 450 flows, more than 50% of my sequences ended in the scrap file)
  • I did not use “trump=.” when I first filtered the sequences because it shortened the mean length of the sequences from 420 to 257 bp *** by the way I did not find an explanation on the wiki/forum, so any explanation here is also more than welcome, note that the sequences seem to align well, and overlap seems ok***. Anyway, after chimera removal, I tested an aditional align+filter step, this time using “trump=.” and results look the same with cluster command taking 70 sec.

Is there something I’m doing wrong?

  1. If you look at our PLoS ONE pipeline paper you’ll see that there’s really no reason to do shhh.flows if you are doing 360-720 flows. The error rates are much higher when you do 360-720 vs 450. If you are losing a lot of sequences, that typically means that you have bad sequence data and really need to go back to your sequence provider. Losing sequences is not an excuse for higher error rates.

  2. Not using trump=. flies in the face of our recent ISMEJ paper that says that it is critical for your reads to overlap each other in teh same alignment space and not more. What you are doing will further artificially increase the number of OTUs you are observing.

In short, please read the PLoS ONE paper and ISMEJ papers.

Thanks a lot for your reply

1/ I did read your PlosOne paper, and also the Quince paper, several posts on the forum … I told my sequence provider the quality of the sequence looked bad , of course he did not agree and recommended me to use 360-720, based on my amplicon size (463 bp)…

2/ Regarding the second issue I’m facing, I did read your ISME paper but I’m not sure to understand what “sequences that do not overlap the same region of the gene” mean. I managed to open the align file with BioEdit but it was not complete, probably because I have not enough memory. Anyway, when looking at the (partial) align file, the sequence seem to align and overlap in the same region, some of them being of course longer and so aligned on a longer region…
Modifying this file made mothur crash when I tried to filter it (after manually removing some sequences)
Please can you tell me what can I do? Is there something I can improve while screening the alignment? See below what I actually did

mothur > align.seqs(fasta=HZ62PXS04.shhh.trim.unique.fasta, reference=silva.bacteria.fasta, processors=2)
Output File Names:
HZ62PXS04.shhh.trim.unique.align
HZ62PXS04.shhh.trim.unique.align.report

mothur > summary.seqs(fasta=HZ62PXS04.shhh.trim.unique.align, name=HZ62PXS04.shhh.trim.unique.names)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 5337 14957 197 0 3 1
2.5%-tile: 6428 15966 235 0 4 2071
25%-tile: 6428 23439 421 0 4 20705
Median: 6428 25432 424 0 5 41409
75%-tile: 6428 25432 430 0 5 62113
97.5%-tile: 6430 25513 452 0 6 80747
Maximum: 13144 26790 467 0 8 82817
Mean: 6428.73 24182.9 409.096 0 4.68513

of unique seqs: 18340

total # of seqs: 82817

mothur > screen.seqs(fasta=HZ62PXS04.shhh.trim.unique.align, name=HZ62PXS04.shhh.trim.unique.names, group=HZ62PXS04.shhh.groups, start=6428, optimize=end, criteria=95, processors=2)

Output File Names :
HZ62PXS04.shhh.trim.unique.good.align
HZ62PXS04.shhh.trim.unique.bad.accnos
HZ62PXS04.shhh.trim.unique.good.names
HZ62PXS04.shhh.good.groups

mothur > summary.seqs(fasta=HZ62PXS04.shhh.trim.unique.good.align, name=HZ62PXS04.shhh.trim.unique.good.names)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 6421 16298 247 0 3 1
2.5%-tile: 6428 21596 309 0 4 1915
25%-tile: 6428 23963 424 0 4 19142
Median: 6428 25432 424 0 5 38283
75%-tile: 6428 25432 430 0 5 57424
97.5%-tile: 6428 25513 452 0 6 74650
Maximum: 6428 26790 467 0 8 76564
Mean: 6428 24633.7 418.601 0 4.69113

of unique seqs: 13945

total # of seqs: 76564


Thanks

I did again a screen.seqs, to remove the sequences which alignment ended before the others and then did another filter.seqs (with trump=.)
I reduced the mean length from 415 to 377 bp and the number of sequences from 13945 to 8800 (!)
As far as I understood, it is either the number of sequence dramatically decreases, either the length of the sequences decreases…
Any advise/experience about what is the best compromise between reducing the number of seq or reducing the length of the seq?
Thanks!
Marina

Well, if you follow the SOP you’ll get reads that are pretty uniformly about 260 bp. If you do that you’ll be fine.