I know several people have posted on the forum regarding the cluster cutoff changing to a lower value. I am having a similar issue with a rather large dataset (325k+ raw reads) and I have tried to make sure I am addressing the issue as recommended from previous posts on the forum (i.e. the screen.seqs step, filtering with trump=.). My 16S Titanium dataset consists of a total of 43 barcoded samples from a cyanobacteria & heterotrophic bacteria chemostat experiment. I started out with four files, two sets for each of the two replicates - since we only had 22 unique barcodes to work with.
I followed the example Costello dataset to process my data using mothur v.1.19.0. After trim.seqs, unique.seqs, align.seqs and screen.seqs, I decided to merge the files:
Rep 1: 109,242 unique sequences (172,590 total)
Rep 2: 86,755 unique sequences (132,198 total)
I ran both replicates through chimera.slayer separately, removed chimeras from the files, filtered the sequences with the trump=. option, and then identified sequences that were duplicated after trimming leaving me with:
Rep 1: 62,531 unique sequences (157,126 total)
Rep 2: 52,129 unique sequences (120,823 total)
I ran pre.cluster which removed about 20k sequences from each:
Rep 1: 40,159 unique sequences (157,126 total)
Rep 2: 33,942 unique sequences (120,823 total)
I then classified the sequences using the RDP training set and began sequence analysis. This is where I began having issues. I had to run these files on a Linux operating system with 16 GB RAM because my MacBook Pro (4 GB RAM) could not process some of the 4-6 GB distance matrices. I ran a distance matrix for both replicates individually and set the cutoff value to 0.1. I then ran the cluster using the average neighbor algorithm on each column-formatted matrix with a cutoff set to 0.05. In both instances, it changed my cutoffs to 0.018518 and 0.02063609 for Rep 1 and Rep 2, respectively. The make.shared file reports numbers for unique and 0.01 for Rep 1 and numbers for unique, 0.01 and 0.02 for Rep 2. After classifying the OTUs:
Rep 1 - 7176 OTUs
Rep 2 - 667 OTUs
I would like to work with a 0.03 cutoff value and I wasn’t sure whether there was an issue with the code, how I processed my sequences, or if these cutoffs at 0.03 are not any different from those at 0.01 (for Rep 1) and 0.02 (for Rep 2) and that is all the cluster command will output for my dataset. I would be happy to send you any of the code files or output files if you think those would help resolve this issue.