Clustering at 98% identity threshold level

_renh · August 7, 2017, 2:47pm

Hi everybody,

I’m been following the MiSeq SOP of mothur wiki to process my 18S rRNA amplicon dataset (V4 region).
I would like to cluster my seqs reads into OTUs using the 98% identity threshold instead of 97% (default).

My mothur logfile looks like this:

Linux version

Using ReadLine

Running 64Bit Version

mothur v.1.38.1
Last updated: 8/9/2016

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program
Batch Mode


mothur > make.contigs(file=nice15.files, processors=56)

Using 56 processors.

>>>>> Processing file pair NB_S_Eu565F-Eu981R-1-ICE-18S_R1.fastq - NB_S_Eu565F-Eu981R-1-ICE-18S_R2.fastq (files 1 of 9) <<<<<
Making contigs...
Done.

It took 17 secs to assemble 227506 reads.


>>>>> Processing file pair NB_M_Eu565F-Eu981R-2-ICE-18S_R1.fastq - NB_M_Eu565F-Eu981R-2-ICE-18S_R2.fastq (files 2 of 9) <<<<<
Making contigs...
Done.

It took 18 secs to assemble 259245 reads.


>>>>> Processing file pair NB_B_Eu565F-Eu981R-3-ICE-18S_R1.fastq - NB_B_Eu565F-Eu981R-3-ICE-18S_R2.fastq (files 3 of 9) <<<<<
Making contigs...
Done.

It took 6 secs to assemble 78115 reads.


>>>>> Processing file pair TR_S_Eu565F-Eu981R-4-ICE-18S_R1.fastq - TR_S_Eu565F-Eu981R-4-ICE-18S_R2.fastq (files 4 of 9) <<<<<
Making contigs...
Done.

It took 19 secs to assemble 257892 reads.


>>>>> Processing file pair TR_M_Eu565F-Eu981R-5-ICE-18S_R1.fastq - TR_M_Eu565F-Eu981R-5-ICE-18S_R2.fastq (files 5 of 9) <<<<<
Making contigs...
Done.

It took 14 secs to assemble 196070 reads.


>>>>> Processing file pair TR_B_Eu565F-Eu981R-6-ICE-18S_R1.fastq - TR_B_Eu565F-Eu981R-6-ICE-18S_R2.fastq (files 6 of 9) <<<<<
Making contigs...
Done.

It took 12 secs to assemble 156678 reads.


>>>>> Processing file pair YP_S_Eu565F-Eu981R-7-ICE-18S_R1.fastq - YP_S_Eu565F-Eu981R-7-ICE-18S_R2.fastq (files 7 of 9) <<<<<
Making contigs...
Done.

It took 27 secs to assemble 368290 reads.


>>>>> Processing file pair YP_M_Eu565F-Eu981R-8-ICE-18S_R1.fastq - YP_M_Eu565F-Eu981R-8-ICE-18S_R2.fastq (files 8 of 9) <<<<<
Making contigs...
Done.

It took 22 secs to assemble 301489 reads.


>>>>> Processing file pair YP_B_Eu565F-Eu981R-9-ICE-18S_R1.fastq - YP_B_Eu565F-Eu981R-9-ICE-18S_R2.fastq (files 9 of 9) <<<<<
Making contigs...
Done.

It took 16 secs to assemble 191285 reads.

It took 151 secs to process 2036570 sequences.

Group count: 
NB_B 78115
NB_M 259245
NB_S 227506
TR_B 156678
TR_M 196070
TR_S 257892
YP_B 191285
YP_M 301489
YP_S 368290

Total of all groups is 2036570

Output File Names: 
nice15.trim.contigs.fasta
nice15.trim.contigs.qual
nice15.contigs.report
nice15.scrap.contigs.fasta
nice15.scrap.contigs.qual
nice15.contigs.groups

[WARNING]: your sequence names contained ':'.  I changed them to '_' to avoid problems in your downstream analysis.

mothur > summary.seqs(fasta=nice15.trim.contigs.fasta) #sum the results 

Using 56 processors.

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 135 135 0 3 1
2.5%-tile: 1 371 371 0 5 50915
25%-tile: 1 378 378 0 5 509143
Median:  1 378 378 0 5 1018286
75%-tile: 1 380 380 1 6 1527428
97.5%-tile: 1 384 384 10 6 1985656
Maximum: 1 540 540 68 268 2036570
Mean: 1 378.726 378.726 1.18402 5.4156
# of Seqs: 2036570

Output File Names: 
nice15.trim.contigs.summary

It took 2 secs to summarize 2036570 sequences.

mothur > screen.seqs(fasta=nice15.trim.contigs.fasta, group=nice15.contigs.groups, summary=nice15.trim.contigs.summary, maxambig=0, minlength=365, maxlength=385)

Using 56 processors.

Output File Names: 
nice15.trim.contigs.good.summary
nice15.trim.contigs.good.fasta
nice15.trim.contigs.bad.accnos
nice15.contigs.good.groups


It took 14 secs to screen 2036570 sequences.

mothur > unique.seqs(fasta=nice15.trim.contigs.good.fasta)
1350747 333889

Output File Names: 
nice15.trim.contigs.good.names
nice15.trim.contigs.good.unique.fasta


mothur > count.seqs(name=nice15.trim.contigs.good.names, group=nice15.contigs.good.groups) #abundance of each unique sequence and their distribution in each sample

Using 56 processors.
It took 7 secs to create a table for 1350747 sequences.


Total number of sequences: 1350747

Output File Names: 
nice15.trim.contigs.good.count_table


mothur > summary.seqs(count=nice15.trim.contigs.good.count_table) #sum the results
Using nice15.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 56 processors.

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 365 365 0 4 1
2.5%-tile: 1 371 371 0 5 33769
25%-tile: 1 378 378 0 5 337687
Median:  1 378 378 0 5 675374
75%-tile: 1 380 380 0 6 1013061
97.5%-tile: 1 384 384 0 6 1316979
Maximum: 1 385 385 0 11 1350747
Mean: 1 378.587 378.587 0 5.40228
# of unique seqs: 333889
total # of seqs: 1350747

Output File Names: 
nice15.trim.contigs.good.unique.summary

It took 1 secs to summarize 1350747 sequences.

mothur > pcr.seqs(fasta=Chlamydomonas_reinhardtii.fasta, oligos=pcrTest.oligos)

Using 56 processors.

Output File Names: 
Chlamydomonas_reinhardtii.pcr.fasta


It took 0 secs to screen 1 sequences.

mothur > align.seqs(fasta=Chlamydomonas_reinhardtii.pcr.fasta, reference=silva_nr_v128.fasta)

Using 56 processors.

Reading in the silva_nr_v128.fasta template sequences... DONE.
It took 206 to read  190661 sequences.
Aligning sequences from Chlamydomonas_reinhardtii.pcr.fasta ...
It took 1 secs to align 1 sequences.


Output File Names: 
Chlamydomonas_reinhardtii.pcr.align
Chlamydomonas_reinhardtii.pcr.align.report


mothur > summary.seqs(fasta=Chlamydomonas_reinhardtii.pcr.align) #positions of the alignment consensus region 

Using 56 processors.

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 13876 22115 374 0 5 1
2.5%-tile: 13876 22115 374 0 5 1
25%-tile: 13876 22115 374 0 5 1
Median:  13876 22115 374 0 5 1
75%-tile: 13876 22115 374 0 5 1
97.5%-tile: 13876 22115 374 0 5 1
Maximum: 13876 22115 374 0 5 1
Mean: 13876 22115 374 0 5
# of Seqs: 1

Output File Names: 
Chlamydomonas_reinhardtii.pcr.summary

It took 0 secs to summarize 1 sequences.

mothur > pcr.seqs(fasta=silva_nr_v128.fasta, start=13876, end=22115, keepdots=F)

Using 56 processors.

Output File Names: 
silva_nr_v128.pcr.fasta


It took 16 secs to screen 190661 sequences.

mothur > system(mv silva_nr_v128.pcr.fasta silva.v4.fasta)


mothur > align.seqs(fasta=nice15.trim.contigs.good.unique.fasta, reference=silva.v4.fasta, flip=T)

Using 56 processors.

Reading in the silva.v4.fasta template sequences... DONE.
It took 35 to read  190661 sequences.
Aligning sequences from nice15.trim.contigs.good.unique.fasta ...
[WARNING]: Some of your sequences generated alignments that eliminated too many bases, a list is provided in nice15.trim.contigs.good.unique.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 134 secs to align 333889 sequences.


Output File Names: 
nice15.trim.contigs.good.unique.align
nice15.trim.contigs.good.unique.align.report
nice15.trim.contigs.good.unique.flip.accnos


mothur > summary.seqs(fasta=nice15.trim.contigs.good.unique.align, count=nice15.trim.contigs.good.count_table) #sum the results

Using 56 processors.

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 2 5 3 0 1 1
2.5%-tile: 2 8239 370 0 5 33769
25%-tile: 2 8239 377 0 5 337687
Median:  2 8239 377 0 5 675374
75%-tile: 2 8239 379 0 6 1013061
97.5%-tile: 2 8239 383 0 6 1316979
Maximum: 8235 8239 384 0 11 1350747
Mean: 3.00171 8238.94 377.58 0 5.40221
# of unique seqs: 333889
total # of seqs: 1350747

Output File Names: 
nice15.trim.contigs.good.unique.summary

It took 6 secs to summarize 1350747 sequences.

mothur > screen.seqs(fasta=nice15.trim.contigs.good.unique.align, count=nice15.trim.contigs.good.count_table, summary=nice15.trim.contigs.good.unique.summary, start=2, end=8239, maxhomop=8)

Using 56 processors.

Output File Names: 
nice15.trim.contigs.good.unique.good.summary
nice15.trim.contigs.good.unique.good.align
nice15.trim.contigs.good.unique.bad.accnos
nice15.trim.contigs.good.good.count_table


It took 12 secs to screen 333889 sequences.

mothur > filter.seqs(fasta=nice15.trim.contigs.good.unique.good.align, vertical=T, trump=.)

Using 56 processors.
Creating Filter... 


Running Filter... 



Length of filtered alignment: 962
Number of columns removed: 7277
Length of the original alignment: 8239
Number of sequences used to construct filter: 332930

Output File Names: 
nice15.filter
nice15.trim.contigs.good.unique.good.filter.fasta


mothur > unique.seqs(fasta=nice15.trim.contigs.good.unique.good.filter.fasta, count=nice15.trim.contigs.good.good.count_table)
332930 332306

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.count_table
nice15.trim.contigs.good.unique.good.filter.unique.fasta


mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group NB_M:

mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group NB_S:

mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group TR_B:

mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group TR_M:

mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group TR_S:

mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group YP_B:

mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group YP_M:

mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group NB_B:

mothur > pre.cluster(fasta=nice15.trim.contigs.good.unique.good.filter.unique.fasta, count=nice15.trim.contigs.good.unique.good.filter.count_table, diffs=3)

Using 56 processors.

Processing group YP_S:
17609 3222 14387
Total number of sequences before pre.cluster was 17609.
pre.cluster removed 14387 sequences.

It took 4 secs to cluster 17609 sequences.
28465 3739 24726
Total number of sequences before pre.cluster was 28465.
pre.cluster removed 24726 sequences.

It took 6 secs to cluster 28465 sequences.
30254 5712 24542
Total number of sequences before pre.cluster was 30254.
pre.cluster removed 24542 sequences.

It took 14 secs to cluster 30254 sequences.
51655 7149 44506
Total number of sequences before pre.cluster was 51655.
pre.cluster removed 44506 sequences.

It took 24 secs to cluster 51655 sequences.
51612 8210 43402
Total number of sequences before pre.cluster was 51612.
pre.cluster removed 43402 sequences.

It took 27 secs to cluster 51612 sequences.
53803 7447 46356
Total number of sequences before pre.cluster was 53803.
pre.cluster removed 46356 sequences.

It took 28 secs to cluster 53803 sequences.
40923 8038 32885
Total number of sequences before pre.cluster was 40923.
pre.cluster removed 32885 sequences.

It took 31 secs to cluster 40923 sequences.
44383 9797 34586
Total number of sequences before pre.cluster was 44383.
pre.cluster removed 34586 sequences.

It took 40 secs to cluster 44383 sequences.
64160 11104 53056
Total number of sequences before pre.cluster was 64160.
pre.cluster removed 53056 sequences.

It took 53 secs to cluster 64160 sequences.
It took 75 secs to run pre.cluster.

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.fasta
nice15.trim.contigs.good.unique.good.filter.unique.precluster.count_table
nice15.trim.contigs.good.unique.good.filter.unique.precluster.NB_B.map
nice15.trim.contigs.good.unique.good.filter.unique.precluster.NB_M.map
nice15.trim.contigs.good.unique.good.filter.unique.precluster.NB_S.map
nice15.trim.contigs.good.unique.good.filter.unique.precluster.TR_B.map
nice15.trim.contigs.good.unique.good.filter.unique.precluster.TR_M.map
nice15.trim.contigs.good.unique.good.filter.unique.precluster.TR_S.map
nice15.trim.contigs.good.unique.good.filter.unique.precluster.YP_B.map
nice15.trim.contigs.good.unique.good.filter.unique.precluster.YP_M.map
nice15.trim.contigs.good.unique.good.filter.unique.precluster.YP_S.map


mothur > chimera.uchime(fasta=nice15.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=nice15.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t) #identifying the chimeric seqs

Using 56 processors.

uchime by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.

Checking sequences from nice15.trim.contigs.good.unique.good.filter.unique.precluster.fasta ...
49203 here

It took 78 secs to check 3222 sequences from group NB_B.

It took 119 secs to check 3739 sequences from group TR_B.

It took 177 secs to check 5712 sequences from group YP_B.

It took 380 secs to check 7149 sequences from group TR_S.

It took 412 secs to check 7447 sequences from group NB_M.

It took 420 secs to check 8210 sequences from group YP_M.

It took 450 secs to check 8038 sequences from group TR_M.

It took 533 secs to check 9797 sequences from group NB_S.

It took 791 secs to check 11104 sequences from group YP_S.

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table
nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.chimeras
nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos


mothur > remove.seqs(fasta=nice15.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos) #excluding chimeric seqs
[WARNING]: This command can take a namefile and you did not provide one. The current namefile is nice15.trim.contigs.good.names which seems to match nice15.trim.contigs.good.unique.good.filter.unique.precluster.fasta.
Removed 17026 sequences from your fasta file.

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta

mothur > remove.lineage(fasta=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.silva_nr_v128.wang.taxonomy, taxon=Vertebrata-Annelida-Arthropoda-Cnidaria-Ctenophora-Echinodermata-Florideophycidae-Mollusca-Pav3-D226-FV18-2D11-Tunicata)

[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.


Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.silva_nr_v128.wang.pick.taxonomy
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta
nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table


mothur > dist.seqs(fasta=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.02, processors=56) 

Using 56 processors.

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist

It took 232 seconds to calculate the distances for 38641 sequences.

mothur > cluster(column=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, cutoff=0.02)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix:     |||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
changed cutoff to 0.00553361
It took 66 seconds to cluster

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list


mothur > make.shared(list=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=0.02)
unique

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.shared


mothur > classify.otu(list=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=nice15.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.silva_nr_v128.wang.taxonomy, label=0.02)
Your file does not include the label 0.02. I will use unique.
unique 38641

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.unique.cons.taxonomy
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.unique.cons.tax.summary

mothur > make.biom(shared=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.shared, constaxonomy=nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.unique.cons.taxonomy) 
unique

Output File Names: 
nice15.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.unique.biom


mothur > quit()

Despite I had used the ‘cutoff=0.02’ with

dist.seqs

cmd, it seems like I set this to ‘unique’. Why this happens?
Thus, now I’m wondering if the unique seqs were clustered at 98% identity threshold level or they represent just ‘unique’ (100%) seqs.

Can mothur's team members help me on this, please. Thanks in advance, @renh@

dwaite · August 7, 2017, 10:02pm

I think your problem is this, where the average neighbour clustering is reducing the between-sequence difference. There are two solutions to this - the best one is to download the newest release of mothur, because it uses a new clustering algorithm that doesn’t perform this distance averaging (so when you use the 0.02 cutoff, you get 98% OTUs).

If you can’t update your version of mothur, try dist.seqs again using a high cutoff, like 0.15. You should then be able to recluster with the 98% threshold.

_renh · August 9, 2017, 11:54am

Thank you dwaite.
I opted by the first option and it seems to work fine.

Again, thank you,
@renh@

Topic		Replies	Views
Average Clustering of ~10k unique V6 sequences Commands in mothur	10	8672	May 27, 2011
mothur does not complete batch script Commands in mothur	5	2035	June 8, 2016
HELP Pre.Cluster Not Working Commands in mothur	2	549	August 14, 2022
Individual count table and align files for each sample	1	287	February 28, 2023
screen.seq command problem Commands in mothur	13	5165	February 8, 2016

Clustering at 98% identity threshold level

Related topics