cluster.split

Could anyone explain what is going on here? I get the following message when I start the cluster.split command:

Splitting the file…
ERROR: >01552_62_000000000-AY344_1_1101_11866_1907 is missing from your fastafile. This could happen if your taxonomy file is not unique and your fastafile is, or it could indicate an error.

The indicated sequence is definitively in the fastafile. :evil:


Ups... just realized that my .pick.taxonomy file looks very weired. I hope restoring this one will help me out! :D

Let us know if that doesn’t fix the problem for you.

Pat

The problem was fixed by remaking the taxonomy file :smiley:

I was wondering if you could provide some guidance on the cluster.split command.

I am having an issue with cluster.split but I don’t think it is a poor data quality issue, maybe it’s a ram issue? When I run cluster.split I end up with over 300 dist files that eat up over 2 TB of space but no list file is generated from the command. I ran my mock community through the error analysis and it had an error rate of 0.000265491. I have tried to run cluster.split instead of the classic cluster command because of the size of this new data set. In all my past data analyses I have only run cluster so I’m not sure if I have done something wrong.

I am running mothur 1.39.4 on a linux virtual workstation. It is a “red hat enterprice linux workstation”, release 6.9 (Santiago), kernel linux 2.6.32-696.1.1.el6.x86_64. I have 62 GB RAM, 3 Intel® Xeon® processors, and 5 TB disk space. Because this is a virtual space I have the option of increasing my RAM, # CPUs, and disk space if any of those are the issue here.

I am processing about 12.5 million sequences from ~370 samples. Our samples were sequenced at Argonne National Laboratories following the Earth Microbiome 16S Illumina Amplicon Protocol. They used primers 515F-806R to target the V4 region and we have paired end 251x251 sequencing done. I am following the MiSeq SOP on the mothur wiki page.

I used the fasta option to run cluster.split.

cluster.split(fasta=Run2_16S_R1.trim.contigs.pick.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, count=Run2_16S_R1.trim.contigs.pick.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.pick.count_table, taxonomy=Run2_16S_R1.trim.contigs.pick.good.unique.good.filter.unique.precluster.pick.nr_v123.wang.pick.pick.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=1)


Instead of getting the list file as an output I end up with 374 dist files. (example:….fasta.374.dist). The largest of these dist files is 242.2 GB, with 3 other dist files being over 100 GB. The majority of the rest of the 300 files are less than 10 GB each.

Have I run the command incorrectly or is this a memory issue? Any guidance is greatly appreciated!

I thought Argonne’s EMP sequencing was done on a HiSeq 2x150?

Your largest dist exceed the size of your RAM so will not process. You have a few options-1) use the new clustering algorithm (method=opti, cutoff=0.03), 2) precluster your seqs with higher diffs, or 3) cluster.split with higher taxa level (this only works if your big dist are composed of several groups at taxlevel=5 or 6. Most of the time when I’ve run into dist size issues, increasing taxlevel hasn’t changed anything because they group is unclassified at finer taxlevels)