cluster: cutoff changed lower than demanded

Dear mothur team,
Dear fellow mothur users,

I tried to solve my problem by browsing through the forum and wiki and still do not really understand why my processing is behaving in the certain way it does. I know that this is a topic that has been documented in the FAQ, several forum threads and various discussion boards outside of the mothur webpage. Nevertheless I cannot fully grasp why my dataset is behaving this particular way.
I would really much appreciate your help.

This is the framework:

My dataset
• Partial 16S rRNA gene tags from benthic and pelagic origin, amplified with primer set 341F[barcoded] and 785R/805R
• PCR free TruSeq library prep
• Illumina PE library (2*300bp) that were quality trimmed and merged with the BBtrim/BBmerge at “only” 60bp overlap (= no entire overlap, since this aspect is mentioned in the mothur context several times)
• Input: 1449949 sequences from 31 samples

I followed the MiSeq SOP with following commands (see at the end of the entry for clarification). Input to distance matrix as following (meaning, that my matrix should not be considered as a sparse matrix, right?) for 31 samples.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1243 407 0 3 1
2.5%-tile: 3 1243 440 0 4 31967
25%-tile: 3 1243 441 0 4 319663
Median: 3 1243 460 0 5 639325
75%-tile: 3 1243 465 0 6 958987
97.5%-tile: 3 1243 466 0 7 1246683
Maximum: 3 1243 496 0 8 1278649
Mean: 2.99484 1243 454.032 0 5.17941

of unique seqs: 523505

total # of seqs: 1278649

After calculating my distance matrix with a cutoff at 0.10 (dist.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.10, processors=8) I ran the cluster command (column=Allsamples.unique.good.filter.unique.precluster.pick.dist, count=Allsamples.unique.good.filter.unique.precluster.pick.count_table, cutof=0.02, hard=t ). I ran it with different settings (hard=t, precision= 100 and 1000) always at a cutoff of 0.02 (=lower that distance matrix) but always got an input whose cutoff was changed to ~0.0075 – 0.0079. Somewhat this does not make any sense to me. Different to other raised demands/questions, I would like my final cutoff to be higher that the adjusted (meaning 0.02 instead of 0.00787841).

Strangely I cannot find the mothur log to validate that I actually set the cutoff to 0.10 and not 0.01 for the dist.seqs command. A wrongly set cutoff would be the only explanation for the low cutoff during clustering. At the moment I am calculating a new distance matrix to rule out that possibility. Since the calculation takes several days and I am pretty sure that, this won’t be the problem I wanted to ask you, whether you have encountered that mentioned issue before or whether you have an idea of what I am doing wrong.

PS: Just running the cluster command without any cutoff set did not help either. For some reasons it only spit out output for unique clusters.

Any help is greatly appreciated.

Thanks a lot!

Below my commands (quality trimming not done with mothur)
• unique.seqs(fasta=Allsamples.fasta)
• count.seqs(name=Allsamples.names, group=Allsamples.groups, processors=8)
• pcr.seqs(fasta=silva.bacteria.fasta,start=6380, end=25316, keepdots=F,processors=8)
• system(mv silva.bacteria.pcr.fasta silva.nr_v119_v3v4.align)
• summary.seqs(fasta=silva.nr_v119_v3v4.align)
• align.seqs(fasta=Allsamples.unique.fasta, reference=silva.nr_v119_v3v4.align,flip=t)
• summary.seqs(fasta=Allsamples.unique.align, count=Allsamples.count_table)
• screen.seqs(fasta=Allsamples.unique.align, count=Allsamples.count_table,summary=Allsamples.unique.summary,start=8, end=18936)
• summary.seqs(fasta=Allsamples.unique.good.align,count=Allsamples.good.count_table, processors=8)
• filter.seqs(fasta=Allsamples.unique.good.align, vertical=T)
• unique.seqs(fasta=Allsamples.unique.good.filter.fasta, count=Allsamples.good.count_table)
• summary.seqs(fasta=Allsamples.unique.good.filter.unique.fasta, count=Allsamples.unique.good.filter.count_table)
• pre.cluster(fasta=Allsamples.unique.good.filter.unique.fasta, count=Allsamples.unique.good.filter.count_table,diffs=2)
• summary.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.fasta, count=Allsamples.unique.good.filter.unique.precluster.count_table)
• classify.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.fasta,count=Allsamples.unique.good.filter.unique.precluster.count_table,reference=silva.nr_v119_v3v4.align,, cutoff=60, processors=8)
• remove.lineage(fasta=Allsamples.unique.good.filter.unique.precluster.fasta, count=Allsamples.unique.good.filter.unique.precluster.count_table,, taxon=Chloroplast)
• summary.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.pick.fasta,count=Allsamples.unique.good.filter.unique.precluster.pick.count_table)
• count.groups(count=Allsamples.unique.good.filter.unique.precluster.pick.count_table)
• dist.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.10, processors=8)
• cluster(column=Allsamples.unique.good.filter.unique.precluster.pick.dist, count=Allsamples.unique.good.filter.unique.precluster.pick.count_table, cutof=0.02, hard=t)
o changed cutoff to 0.00787841

Are you wanting 0.02 OTUs? If so you should be running the clustering with higher cutoff (I’d guess at least double desired OTU level but am just guessing that rule of thumb), then when you run make.shared(label=0.02) to get just the 0.02 OTUs

Thanks for the answer.
I will give it a try. Since I tried to run the command without any cutoff, I doubt, that it will actually give me a cutoff that is not changed automatically below 0.02. Or am I getting it wrong, that a changed cutoff during clustering will not allow me to get any information about OTUs at a higher cutoff level (i.e. I want OTUs at 0.02, cutoff was changed to <0.02).

You are also setting the cutoff in dist.seqs (cutoff=0.10). Try setting that and the cutoff in cluster to 0.20.

Also see this…

Hi Pat,
thanks for the answer.
I have redone everything numerous times now. I never calculated the distance matrix at such “high” cutoff since I wasnt really interested in a cutoff that high. I will give it a try now.

You might also try just using cluster.split, which should be faster and just as accurate.

The combination of calculating a new distance matrix with cutoff=0.20 and cluster.split (including clustering at 0.20) worked for me!
Thanks for the help