I am new to mothur and recently started analysing my Miseq data using the Miseq SOP from the mothur website.
I tried on my smallest data set and it took my computer (windows 64-bit, 4 GB) quite a long time to calculate the commands. The cluster.split command especially took days. After 9 days of calculation mothur closed and I got the error message, there is too little memory to run the command. I am now trying only four samples on a 16 GB windows 64bit computer and its working out quite well. But in this case I have to merge the files for the final alayses.
Here are my two questions:
First: Did I make a mistake that the commands take that long time or is this normal?
Second: Is it wise to merge the shared files or should I do this after another step?
Thanks a lot! This programm and website are really great! Thanks!
Welcome to the mothur club
Can you give some context for what type of data you have? What region? What type of reads? etc?
Hi and thanks for the warm welcome!
I generated the data on a Miseq sequencer, using 300 bp paired end method, targeting the V1-V3 region of bacterial 16S DNA.
The run generated around 150,000 sequences in total for each of the samples. The small data set, my computer was not able to analyse, has 7 samples. My biggest data set has 33 samples. I tried the work flow with only one sample and it behaved well this time. So I subgrouped the bigger data sets now, 4 samples each time. Still, the chimera (I am using uchime) and the cluster.split command take hours to calculate. The chimera command needs around 3 hours for each sample, the cluster.split command for a 4 sample subgroup took over night to split 60,000 sequences. So it will probably take weeks to analyse all the samples this way, because I have more than 80 in total.
Thanks a lot in advance!
So the problem we find is that when sequencing a region like the V13 (~500 bp) is that even with the 300PE kit the reads do not fully overlap (duh). This results in poor error correction. The upshot is that you generate a ton of artificially unique sequences since every error creates a new unique. Ultimately, this causes the problems you are seeing. I suspect your distance matrices will be in the 100’s of GB, which will never go through the clustering step without crashing. Since you have ~500 bp reads, you might try running pre.cluster with diffs=5. Alternatively, you could run cluster.split with level=5 (genus). And then there’s always just doing a phylotype-based analysis.
Thanks for this reply. This is true - the files are huge!
Then I will try with higher differences. What about the idea of splitting the dataset in smaller ones and using the merge.files command to merge the shared files?
Thanks a lot!
Unfortunately, you really can’t merge a shared file since the OTUs will be different between them.
Unfortunately, I am still having problems with the cluster.split command.
I tried as suggested using diffs=5 in the precluster step to reduce the amount of unique seqs. Now there are a lot less unique sequences. Still the .dist file is huge and I am afraid there is something very, very wrong. I applied the trim.seq command then before doing the make.contig command. This reduced to amount of unique seqs to around 100,000. Still this is that much, that the .dist file in the cluster.split command has a size of 150 GB - and of course this step take a lot of time.
Then I added large=t in the command. This seemed a bit faster, but still…
So, here are my questions:
Can so many unique seqs be possible? Is this normal (I have 11 samples, every sample has an amount of around 200,000 seqs total)?
Can it be possible, that the command takes that long and the .dist file is that huge? The computer is doing the job, it takes time, but it is done. But I am worried, because from what I read, it should not be like this.
Any suggestions? Please?
Thanks a lot in advance!
Again the problem is likely to be the high error rate that you’ve gotten from sequencing the V13 region, which does not allow you to obtain complete overlap of your reads. You must have complete overlap to get good error reduction. Otherwise as you are finding, you will have an artificially large number of unique reads. At this point, I’d suggest doing your analysis using the phylotype-based approach.