Dear mothur team,
Dear fellow mothur users,
I tried to solve my problem by browsing through the forum and wiki and still do not really understand why my processing is behaving in the certain way it does. I know that this is a topic that has been documented in the FAQ, several forum threads and various discussion boards outside of the mothur webpage. Nevertheless I cannot fully grasp why my dataset is behaving this particular way.
I would really much appreciate your help.
This is the framework:
My dataset
• Partial 16S rRNA gene tags from benthic and pelagic origin, amplified with primer set 341F[barcoded] and 785R/805R
• PCR free TruSeq library prep
• Illumina PE library (2*300bp) that were quality trimmed and merged with the BBtrim/BBmerge at “only” 60bp overlap (= no entire overlap, since this aspect is mentioned in the mothur context several times)
• Input: 1449949 sequences from 31 samples
I followed the MiSeq SOP with following commands (see at the end of the entry for clarification). Input to distance matrix as following (meaning, that my matrix should not be considered as a sparse matrix, right?) for 31 samples.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1243 407 0 3 1
2.5%-tile: 3 1243 440 0 4 31967
25%-tile: 3 1243 441 0 4 319663
Median: 3 1243 460 0 5 639325
75%-tile: 3 1243 465 0 6 958987
97.5%-tile: 3 1243 466 0 7 1246683
Maximum: 3 1243 496 0 8 1278649
Mean: 2.99484 1243 454.032 0 5.17941
of unique seqs: 523505
total # of seqs: 1278649
After calculating my distance matrix with a cutoff at 0.10 (dist.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.10, processors=8) I ran the cluster command (column=Allsamples.unique.good.filter.unique.precluster.pick.dist, count=Allsamples.unique.good.filter.unique.precluster.pick.count_table, cutof=0.02, hard=t ). I ran it with different settings (hard=t, precision= 100 and 1000) always at a cutoff of 0.02 (=lower that distance matrix) but always got an input whose cutoff was changed to ~0.0075 – 0.0079. Somewhat this does not make any sense to me. Different to other raised demands/questions, I would like my final cutoff to be higher that the adjusted (meaning 0.02 instead of 0.00787841).
Strangely I cannot find the mothur log to validate that I actually set the cutoff to 0.10 and not 0.01 for the dist.seqs command. A wrongly set cutoff would be the only explanation for the low cutoff during clustering. At the moment I am calculating a new distance matrix to rule out that possibility. Since the calculation takes several days and I am pretty sure that, this won’t be the problem I wanted to ask you, whether you have encountered that mentioned issue before or whether you have an idea of what I am doing wrong.
PS: Just running the cluster command without any cutoff set did not help either. For some reasons it only spit out output for unique clusters.
Any help is greatly appreciated.
Thanks a lot!
David
Below my commands (quality trimming not done with mothur)
• unique.seqs(fasta=Allsamples.fasta)
• count.seqs(name=Allsamples.names, group=Allsamples.groups, processors=8)
• pcr.seqs(fasta=silva.bacteria.fasta,start=6380, end=25316, keepdots=F,processors=8)
• system(mv silva.bacteria.pcr.fasta silva.nr_v119_v3v4.align)
• summary.seqs(fasta=silva.nr_v119_v3v4.align)
• align.seqs(fasta=Allsamples.unique.fasta, reference=silva.nr_v119_v3v4.align,flip=t)
• summary.seqs(fasta=Allsamples.unique.align, count=Allsamples.count_table)
• screen.seqs(fasta=Allsamples.unique.align, count=Allsamples.count_table,summary=Allsamples.unique.summary,start=8, end=18936)
• summary.seqs(fasta=Allsamples.unique.good.align,count=Allsamples.good.count_table, processors=8)
• filter.seqs(fasta=Allsamples.unique.good.align, vertical=T)
• unique.seqs(fasta=Allsamples.unique.good.filter.fasta, count=Allsamples.good.count_table)
• summary.seqs(fasta=Allsamples.unique.good.filter.unique.fasta, count=Allsamples.unique.good.filter.count_table)
• pre.cluster(fasta=Allsamples.unique.good.filter.unique.fasta, count=Allsamples.unique.good.filter.count_table,diffs=2)
• summary.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.fasta, count=Allsamples.unique.good.filter.unique.precluster.count_table)
• classify.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.fasta,count=Allsamples.unique.good.filter.unique.precluster.count_table,reference=silva.nr_v119_v3v4.align, taxonomy=silva.bacteria.silva.tax, cutoff=60, processors=8)
• remove.lineage(fasta=Allsamples.unique.good.filter.unique.precluster.fasta, count=Allsamples.unique.good.filter.unique.precluster.count_table, taxonomy=Allsamples.unique.good.filter.unique.precluster.silva.wang.taxonomy, taxon=Chloroplast)
• summary.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.pick.fasta,count=Allsamples.unique.good.filter.unique.precluster.pick.count_table)
• count.groups(count=Allsamples.unique.good.filter.unique.precluster.pick.count_table)
• dist.seqs(fasta=Allsamples.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.10, processors=8)
• cluster(column=Allsamples.unique.good.filter.unique.precluster.pick.dist, count=Allsamples.unique.good.filter.unique.precluster.pick.count_table, cutof=0.02, hard=t)
o changed cutoff to 0.00787841