I was wondering whether you could provide some assistance with an issue I am having with the cluster.split command. I am currently using mothur V1.41.1 to process around 14.4 million 16S V3-4 reads from 85 fish/environmental samples which were generated using 341F and 806R primers (460bp product) and obtained using V2 chemistry on the Illumina MiSeq. I have been working through the MiSeq SOP and following trimming/quality control steps I now have around 500,000 unique sequences I would like to cluster before performing the OTU classification step.
I am having an issue at the cluster.split command in that it was running for ~14d and still had not completed the command. In addition, when it goes to split the fasta file; I am getting a 3rd column of values when it is running which is labelled “Num_Dists_Below_Cutoff”. I am not sure whether this is a sequence quality or computer memory issue. This never appeared when I was working through the MiSeq SOP data provided or when I was running my own data previously which was a lot smaller (~15.5m 16S V4 reads (292bp) from 12 samples, ~5,000 final unique sequences). I have tried repeating 3 times with this command and I always get the 3rd column and a long processing time which I have to terminate as it is just taking too long. When I performed this step using the smaller dataset it was only taking around 1h. With the current dataset; I never manage to get a .list file or even a .dist file, just a lot (>900) .fasta.temp files. Please note that I have skipped the seq.error command as I am waiting on my 16S sequences from the bacterial isolates in my community standard before performing this command so for now I have skipped to remove.groups and removed this group from the dataset before running the cluster.split command. I have yet to perform the seq.error, list.seqs, cluster, make.shared on the microbiome community standard sample.
I am using the fasta & opticlust option for this command:
cluster.split(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.pick.count_table, taxonomy=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.pick.pick.taxonomy , splitmethod=classify, method=opti, taxlevel=6, cutoff=0.03, processors=8)
As I had to terminate mothur I do not have the original script but I have uploaded a screenshot if this helps.
I am not sure if I have written the command or performed a previous step incorrectly? As I have already clustered the sequences previously using the pre.cluster, what does the cluster.split command achieve in addition to this? I am wondering whether I could bypass this step? Any help will be greatly appreciated.