I am working on my Illumina MiSeq data following the MiSeq SOP and I am now at the cluster.split step. Tried using the traditional dist.seqs and cluster but they didn’t work as I got too many unique sequences. I’ve read the article written by Dr. Schloss and understand that my large distance matrix is probably due to high error rate in my reads. But I have no time and money for another MiSeq run so I have to get on with the data I currently have. Anyway, I am now at the cluster.split step using the command:
cluster.split(fasta=, count=, taxonomy=, splitmethod=classify, taxlevel=4, cutoff=0.15, processors=40)
and I keep getting thousands error messages like below:
[ERROR]: Your count table contains a sequence named with a total=0. Please correct.
[ERROR]: Your count table contains more than 1 sequence named , sequence names must be unique. Please correct.
I’ve checked the count file and I’m sure all sequences have unique names and no sequence has total=0. Can someone please tell me what’s going on?
BTW, I don’t know if this makes any difference but I am using something with 80 processors and 1TB memory, at least that’s what I’ve been told.
BTW the names of my sequence is something like
Not sure if there’s something wrong with the names.
Desperately need help here. Need results to finish my dissertation but the analysis didn’t get me anywhere. At least not yet : (
You likely have an issue in the distance matrix due to its size. You may avoid the error by processors=1. The less processors you use the less memory you will need. The phylotype command may be your only option.
Thanks a lot Westcott! Reducing the processors to 1 will decrease the memory usage but dramatically increase (compare to processors=40) the processing time, is that right? Also, Following the MiSeq pipeline I ended up with 270k unique sequences. If I set the cutoff=0.15 and output=lt, dist.seqs will generate a distance matrix at about 240GB. Is this too big to process?
Another question. And this one might be really stupid. If my data is too large for Mothur to handle, will QIIME be able to do the job? I prefer to use Mothur and I believe if bad data is the reason why Mothur have difficulties processing the job, QIIME will likely to have problems doing it as well, otherwise everybody will choose to use QIIME not Mothur. But my adviser keeps asking me “have you tried using QIIME?” “why don’t you use QIIME?”, which is really annoying. I just hope I can hear someone from this forum says “don’t waste your time using QIIME, if large data was the reason, QIIME will fail too” so that I can tell my adviser I will stick with Mothur.
If QIIME gets it to work it is because they are using a glorified phylotype approach. In addition to what Sarah mentioned, you could try using the classic=T option in cluster.split. It actually sounds like you (and your advisor) need to take a look at this…