pre.cluster with fasta and name file - a faster implementation?

Hi Pat and Sarah,

I was running the sogin example dataset through Mothur to compare summary.seqs outputs after pre.cluster step (diffs=2) with or without a group/count file, and found:

mothur > summary.seqs(fasta=sogin.unique.filter.unique.precluster.fasta,name=sogin.unique.filter.unique.precluster.names,processors=4)

Using 4 processors.

         Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       107     40      0       2       1
2.5%-tile:      1671    1904    57      0       3       5558
25%-tile:       1678    1904    60      0       3       55573
Median:         1678    1904    61      0       3       111146
75%-tile:       1678    1904    64      0       4       166719
97.5%-tile:     1678    1908    71      0       5       216734
Maximum:        2652    2709    100     0       31      222291
Mean:   1662.05 1886.15 62.0029 0       3.42599
# of unique seqs:       13248
total # of seqs:        222291

Output File Names:
sogin.unique.filter.unique.precluster.summary

It took 0 secs to summarize 222291 sequences.


mothur > summary.seqs(fasta=sogin.unique.filter.unique.precluster.fasta,count=sogin.unique.filter.unique.precluster.count_table)

Using 4 processors.

         Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       107     40      0       2       1
2.5%-tile:      1671    1904    57      0       3       5558
25%-tile:       1678    1904    60      0       3       55573
Median:         1678    1904    61      0       3       111146
75%-tile:       1678    1904    64      0       4       166719
97.5%-tile:     1678    1908    71      0       5       216734
Maximum:        2652    2709    100     0       31      222291
Mean:   1662.05 1886.15 62.0133 0       3.42885
# of unique seqs:       17084
total # of seqs:        222291

Output File Names:
sogin.unique.filter.unique.precluster.summary

It took 1 secs to summarize 222291 sequences.

I can see there’s a reduction of nearly 4000 unique sequences using the pre.cluster approach without any group information. I would expect this difference to translate to a much larger number of unique sequences merged when dealing with 100K or more unique sequences in a real dataset. Although this global pre.cluster approach is significantly more time-consuming, it is certainly more helpful in reducing the number of unique sequences for downstream analyses (e.g., OTU clustering, calculating sequence distances).

Likewise, when I ran chimera.uchime on the stool example dataset posted on mothur wiki with or without the use of group information:

mothur > chimera.uchime(fasta=stool.trim.unique.good.align,name=stool.trim.good.names,group=stool.good.groups,processors=4,dereplicate=T)
mothur > remove.seqs(fasta=current,accnos=current,name=current,group=current)
mothur > summary.seqs(fasta=current,name=current)
Using stool.trim.unique.good.pick.align as input file for the fasta parameter.
Using stool.trim.good.pick.names as input file for the name parameter.

Using 4 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       162     162     0       3       1
2.5%-tile:      1       184     184     0       4       661
25%-tile:       1       212     212     0       5       6602
Median:         1       224     224     0       5       13203
75%-tile:       1       231     231     0       5       19804
97.5%-tile:     1       241     241     0       6       25745
Maximum:        1       244     244     0       6       26405
Mean:   1       219.757 219.757 0       4.91668
# of unique seqs:       14692
total # of seqs:        26405

Output File Names:
stool.trim.unique.good.pick.summary

It took 1 secs to summarize 26405 sequences.

mothur > chimera.uchime(fasta=stool.trim.unique.good.align,name=stool.trim.good.names,processors=1)
mothur > remove.seqs(fasta=current,accnos=current,name=current,group=stool.good.groups)
mothur > summary.seqs(fasta=current,name=current)
Using stool.trim.unique.good.pick.align as input file for the fasta parameter.
Using stool.trim.good.pick.names as input file for the name parameter.

Using 1 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       162     162     0       3       1
2.5%-tile:      1       183     183     0       4       635
25%-tile:       1       212     212     0       5       6348
Median:         1       224     224     0       5       12696
75%-tile:       1       231     231     0       5       19044
97.5%-tile:     1       241     241     0       6       24757
Maximum:        1       244     244     0       6       25391
Mean:   1       219.631 219.631 0       4.91741
# of unique seqs:       13693
total # of seqs:        25391
Output File Names:
stool.trim.unique.good.pick.summary

It took 0 secs to summarize 25391 sequences.

I found that Mothur flagged ~1000 additional unique sequences as chimeric when group information was not used to detect chimeric sequences on a per-sample basis.

For many of us who are interested in analyzing large-scale microbiome dataset, ending up with a large number of unique sequences at the final stage of sequence analysis is a major caveat to constructing high-quality de-novo OTUs using the traditional cluster command, which is probably still preferred over the cluster.split approach.

Do we have any idea as to what we could do to speed up these steps?

Thank you as always!