Hi Pat and Sarah,
I was running the sogin example dataset through Mothur to compare summary.seqs outputs after pre.cluster step (diffs=2) with or without a group/count file, and found:
mothur > summary.seqs(fasta=sogin.unique.filter.unique.precluster.fasta,name=sogin.unique.filter.unique.precluster.names,processors=4)
Using 4 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 107 40 0 2 1
2.5%-tile: 1671 1904 57 0 3 5558
25%-tile: 1678 1904 60 0 3 55573
Median: 1678 1904 61 0 3 111146
75%-tile: 1678 1904 64 0 4 166719
97.5%-tile: 1678 1908 71 0 5 216734
Maximum: 2652 2709 100 0 31 222291
Mean: 1662.05 1886.15 62.0029 0 3.42599
# of unique seqs: 13248
total # of seqs: 222291
Output File Names:
sogin.unique.filter.unique.precluster.summary
It took 0 secs to summarize 222291 sequences.
mothur > summary.seqs(fasta=sogin.unique.filter.unique.precluster.fasta,count=sogin.unique.filter.unique.precluster.count_table)
Using 4 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 107 40 0 2 1
2.5%-tile: 1671 1904 57 0 3 5558
25%-tile: 1678 1904 60 0 3 55573
Median: 1678 1904 61 0 3 111146
75%-tile: 1678 1904 64 0 4 166719
97.5%-tile: 1678 1908 71 0 5 216734
Maximum: 2652 2709 100 0 31 222291
Mean: 1662.05 1886.15 62.0133 0 3.42885
# of unique seqs: 17084
total # of seqs: 222291
Output File Names:
sogin.unique.filter.unique.precluster.summary
It took 1 secs to summarize 222291 sequences.
I can see there’s a reduction of nearly 4000 unique sequences using the pre.cluster approach without any group information. I would expect this difference to translate to a much larger number of unique sequences merged when dealing with 100K or more unique sequences in a real dataset. Although this global pre.cluster approach is significantly more time-consuming, it is certainly more helpful in reducing the number of unique sequences for downstream analyses (e.g., OTU clustering, calculating sequence distances).
Likewise, when I ran chimera.uchime on the stool example dataset posted on mothur wiki with or without the use of group information:
mothur > chimera.uchime(fasta=stool.trim.unique.good.align,name=stool.trim.good.names,group=stool.good.groups,processors=4,dereplicate=T)
mothur > remove.seqs(fasta=current,accnos=current,name=current,group=current)
mothur > summary.seqs(fasta=current,name=current)
Using stool.trim.unique.good.pick.align as input file for the fasta parameter.
Using stool.trim.good.pick.names as input file for the name parameter.
Using 4 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 162 162 0 3 1
2.5%-tile: 1 184 184 0 4 661
25%-tile: 1 212 212 0 5 6602
Median: 1 224 224 0 5 13203
75%-tile: 1 231 231 0 5 19804
97.5%-tile: 1 241 241 0 6 25745
Maximum: 1 244 244 0 6 26405
Mean: 1 219.757 219.757 0 4.91668
# of unique seqs: 14692
total # of seqs: 26405
Output File Names:
stool.trim.unique.good.pick.summary
It took 1 secs to summarize 26405 sequences.
mothur > chimera.uchime(fasta=stool.trim.unique.good.align,name=stool.trim.good.names,processors=1)
mothur > remove.seqs(fasta=current,accnos=current,name=current,group=stool.good.groups)
mothur > summary.seqs(fasta=current,name=current)
Using stool.trim.unique.good.pick.align as input file for the fasta parameter.
Using stool.trim.good.pick.names as input file for the name parameter.
Using 1 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 162 162 0 3 1
2.5%-tile: 1 183 183 0 4 635
25%-tile: 1 212 212 0 5 6348
Median: 1 224 224 0 5 12696
75%-tile: 1 231 231 0 5 19044
97.5%-tile: 1 241 241 0 6 24757
Maximum: 1 244 244 0 6 25391
Mean: 1 219.631 219.631 0 4.91741
# of unique seqs: 13693
total # of seqs: 25391
Output File Names:
stool.trim.unique.good.pick.summary
It took 0 secs to summarize 25391 sequences.
I found that Mothur flagged ~1000 additional unique sequences as chimeric when group information was not used to detect chimeric sequences on a per-sample basis.
For many of us who are interested in analyzing large-scale microbiome dataset, ending up with a large number of unique sequences at the final stage of sequence analysis is a major caveat to constructing high-quality de-novo OTUs using the traditional cluster command, which is probably still preferred over the cluster.split approach.
Do we have any idea as to what we could do to speed up these steps?
Thank you as always!