Cluster.split issue "Num_Dists_Below_Cutoff"

Hi,

I was wondering whether you could provide some assistance with an issue I am having with the cluster.split command. I am currently using mothur V1.41.1 to process around 14.4 million 16S V3-4 reads from 85 fish/environmental samples which were generated using 341F and 806R primers (460bp product) and obtained using V2 chemistry on the Illumina MiSeq. I have been working through the MiSeq SOP and following trimming/quality control steps I now have around 500,000 unique sequences I would like to cluster before performing the OTU classification step.

I am having an issue at the cluster.split command in that it was running for ~14d and still had not completed the command. In addition, when it goes to split the fasta file; I am getting a 3rd column of values when it is running which is labelled “Num_Dists_Below_Cutoff”. I am not sure whether this is a sequence quality or computer memory issue. This never appeared when I was working through the MiSeq SOP data provided or when I was running my own data previously which was a lot smaller (~15.5m 16S V4 reads (292bp) from 12 samples, ~5,000 final unique sequences). I have tried repeating 3 times with this command and I always get the 3rd column and a long processing time which I have to terminate as it is just taking too long. When I performed this step using the smaller dataset it was only taking around 1h. With the current dataset; I never manage to get a .list file or even a .dist file, just a lot (>900) .fasta.temp files. Please note that I have skipped the seq.error command as I am waiting on my 16S sequences from the bacterial isolates in my community standard before performing this command so for now I have skipped to remove.groups and removed this group from the dataset before running the cluster.split command. I have yet to perform the seq.error, list.seqs, cluster, make.shared on the microbiome community standard sample.

I am using the fasta & opticlust option for this command:
cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.pick.pick.taxonomy , splitmethod=classify, method=opti, taxlevel=6, cutoff=0.03, processors=8)

As I had to terminate mothur I do not have the original script but I have uploaded a screenshot if this helps.

I am not sure if I have written the command or performed a previous step incorrectly? As I have already clustered the sequences previously using the pre.cluster, what does the cluster.split command achieve in addition to this? I am wondering whether I could bypass this step? Any help will be greatly appreciated.

Thanks
Chris
split%201st%20attempt

Hi,

The third column is fine and is supposed to be there. The problem is that you have a lot of samples, a lot of sequences, and the sequence quality is pretty poor. I’d encourage you to check out this blog post about the effects of sequencing regions that do not fully overlap each other as you have with the V3-V4 region. You might try to increase the number of diffs in pre.cluster from 2 to 4.

The cluster.split step is necessary to cluster your sequences into OTUs. If you don’t do that, you won’t have OTUs.

Pat

Hi Pat,

Thanks for your speedy response.

OK I think the 3rd column just threw me off as I had never observed this before when running the command. In the pre.cluster command I used 5 as my number of differences which took my number of unique sequences down from 1.7 million to 561,000. Does this seem about right?

Would it be possible to split my dataset into sampling groups and then perform cluster.split separately on each one, then somehow (not sure if possible) merge the .list files at the end before performing classify.otu?

Thanks
Chris

No, you really can’t cluster each sample separately. I suspect the problem is that you have data with a high sequencing error rate that isn’t getting adequately denoised because the reads don’t fully overlap.

Pat

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.