Error message when doing cluster.split

pdlnn · October 14, 2014, 12:44am

Hello everyone,

I am working on my Illumina MiSeq data following the MiSeq SOP and I am now at the cluster.split step. Tried using the traditional dist.seqs and cluster but they didn’t work as I got too many unique sequences. I’ve read the article written by Dr. Schloss and understand that my large distance matrix is probably due to high error rate in my reads. But I have no time and money for another MiSeq run so I have to get on with the data I currently have. Anyway, I am now at the cluster.split step using the command:

cluster.split(fasta=, count=, taxonomy=, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=40)

and I keep getting thousands error messages like below:
[ERROR]: Your count table contains a sequence named with a total=0. Please correct.
[ERROR]: Your count table contains more than 1 sequence named , sequence names must be unique. Please correct.

I’ve checked the count file and I’m sure all sequences have unique names and no sequence has total=0. Can someone please tell me what’s going on?
BTW, I don’t know if this makes any difference but I am using something with 80 processors and 1TB memory, at least that’s what I’ve been told.

Thanks!

pdlnn · October 14, 2014, 12:47am

BTW the names of my sequence is something like
M01936_54_000000000-A7D92_1_1101_21303_2664

Not sure if there’s something wrong with the names.

Thanks.

pdlnn · October 14, 2014, 5:50pm

Desperately need help here. Need results to finish my dissertation but the analysis didn’t get me anywhere. At least not yet : (

westcott · October 14, 2014, 7:11pm

You likely have an issue in the distance matrix due to its size. You may avoid the error by processors=1. The less processors you use the less memory you will need. The phylotype command may be your only option.

pdlnn · October 14, 2014, 7:31pm

Thanks a lot Westcott! Reducing the processors to 1 will decrease the memory usage but dramatically increase (compare to processors=40) the processing time, is that right? Also, Following the MiSeq pipeline I ended up with 270k unique sequences. If I set the cutoff=0.15 and output=lt, dist.seqs will generate a distance matrix at about 240GB. Is this too big to process?

pdlnn · October 14, 2014, 7:47pm

Another question. And this one might be really stupid. If my data is too large for Mothur to handle, will QIIME be able to do the job? I prefer to use Mothur and I believe if bad data is the reason why Mothur have difficulties processing the job, QIIME will likely to have problems doing it as well, otherwise everybody will choose to use QIIME not Mothur. But my adviser keeps asking me “have you tried using QIIME?” “why don’t you use QIIME?”, which is really annoying. I just hope I can hear someone from this forum says “don’t waste your time using QIIME, if large data was the reason, QIIME will fail too” so that I can tell my adviser I will stick with Mothur.

pschloss · October 20, 2014, 9:34pm

If QIIME gets it to work it is because they are using a glorified phylotype approach. In addition to what Sarah mentioned, you could try using the classic=T option in cluster.split. It actually sounds like you (and your advisor) need to take a look at this…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Topic		Replies	Views
Successful use of cluster.split with Windows? mothur bugs	3	1039	August 31, 2020
error message when running cluster.split with file option mothur bugs	1	2159	June 30, 2015
Issues with cluster command Commands in mothur	5	4450	December 19, 2012
Cluster.split issue "Num_Dists_Below_Cutoff" Commands in mothur	4	1157	March 14, 2019
cluster.split Commands in mothur	13	8684	July 15, 2013

Error message when doing cluster.split

Related topics