Cluster cutoff issue

I know several people have posted on the forum regarding the cluster cutoff changing to a lower value. I am having a similar issue with a rather large dataset (325k+ raw reads) and I have tried to make sure I am addressing the issue as recommended from previous posts on the forum (i.e. the screen.seqs step, filtering with trump=.). My 16S Titanium dataset consists of a total of 43 barcoded samples from a cyanobacteria & heterotrophic bacteria chemostat experiment. I started out with four files, two sets for each of the two replicates - since we only had 22 unique barcodes to work with.

I followed the example Costello dataset to process my data using mothur v.1.19.0. After trim.seqs, unique.seqs, align.seqs and screen.seqs, I decided to merge the files:
Rep 1: 109,242 unique sequences (172,590 total)
Rep 2: 86,755 unique sequences (132,198 total)

I ran both replicates through chimera.slayer separately, removed chimeras from the files, filtered the sequences with the trump=. option, and then identified sequences that were duplicated after trimming leaving me with:
Rep 1: 62,531 unique sequences (157,126 total)
Rep 2: 52,129 unique sequences (120,823 total)

I ran pre.cluster which removed about 20k sequences from each:
Rep 1: 40,159 unique sequences (157,126 total)
Rep 2: 33,942 unique sequences (120,823 total)

I then classified the sequences using the RDP training set and began sequence analysis. This is where I began having issues. I had to run these files on a Linux operating system with 16 GB RAM because my MacBook Pro (4 GB RAM) could not process some of the 4-6 GB distance matrices. I ran a distance matrix for both replicates individually and set the cutoff value to 0.1. I then ran the cluster using the average neighbor algorithm on each column-formatted matrix with a cutoff set to 0.05. In both instances, it changed my cutoffs to 0.018518 and 0.02063609 for Rep 1 and Rep 2, respectively. The make.shared file reports numbers for unique and 0.01 for Rep 1 and numbers for unique, 0.01 and 0.02 for Rep 2. After classifying the OTUs:
Rep 1 - 7176 OTUs
Rep 2 - 667 OTUs

I would like to work with a 0.03 cutoff value and I wasn’t sure whether there was an issue with the code, how I processed my sequences, or if these cutoffs at 0.03 are not any different from those at 0.01 (for Rep 1) and 0.02 (for Rep 2) and that is all the cluster command will output for my dataset. I would be happy to send you any of the code files or output files if you think those would help resolve this issue.



This is a byproduct of how we implement the average neighbor algorithm to use smaller matrices. You might try to use a cutoff of 0.15 or 0.20. I’m also surprised that you have so many unique sequences for the number of total sequences - are you using the quality settings in trim.seqs? I just updated the Costello example to hopefully make the chimera checking faster and more sensitive.


Hi Pat,

Thanks for your prompt reply. In response to your question, for the trim.seqs, I used the qaverage=25 option. I have tried setting the distance matrix cutoff to 0.03, 0.1, 0.2, and 0.3 with my cluster cutoff at 0.03.

@ 0.03 - the cutoff is changed to 0.0122488 and 0.0126267, for Rep 1 & Rep 2
@ 0.1 - the cutoff is changed to 0.018518 and 0.0203609, for Rep 1 & Rep 2
@ 0.2 - the cutoff is changed to 0.0120992 and 0.0126267, for Rep 1 & Rep 2
@ 0.3 - the cutoff is changed to 0.012245 and 0.0126267, for Rep 1 & Rep 2
~So the make.shared and classify.otu commands always use 0.01

I also tried generating a distance matrix with 25% of the sequences (~7693 for Rep 2). I manually removed the remaining 75% from both the fasta and names files. After setting the distance matrix cutoff at 0.1 again, my cluster cutoff still changed to 0.0141268.

Any other ideas? I am considering running my dataset through the pipeline again soon using v.1.20.0 and possibly changing some of the trim/screen options. Thanks for your help.


Hi Claire,

I’m afraid qaverage=25 won’t do much for you. I’d strongly encourage you to do qwindowaverage=35, qwindowsize=50 in trim.seqs.


Thanks Pat. I will try running the dataset again through the pipeline on v.1.20.0 with the qwindowaverage/qwindowsize options set.

Could you explain why qaverage=25 will not work for this dataset?


So the data come off the sequencer with an average Q of 20 and an error rate ~0.6%. If you use an average of 30 the error rate drops to about 0.55%. If you use the qwindowaverage=35 and qwindowsize of 50, the error rate drops to ~0.1%. After you use pre.cluster on the data the error rate drops even further. The problem with the higher error rates is that it is basically generating spurious sequence reads. With the lower error rate, you will have fewer unique sequence reads. This makes a lot of the downstream processing a lot easier and more reliable.

Hope this helps…

Hi Pat -

Are you aware of a publication that uses this qwindowaverage approach to quality screening? I’d like to have some published precedent for handling the data this way…

Ummm… Yeah, it’s on my computer waiting to hear back from a journal on whether I can submit it to them :). Hopefully it will be submitted on Monday or Tuesday.