I’ve been stuck for a couple of months trying to cluster a large data set. The sequenced region is V3-V4 without complete overlap and likely high sequencing error. I’ve been mainly following the steps in the MiSeq protocol. The original number of sequences from about two dozen samples totaled close to 4 million, with unique sequences around 450k. I was able to create a column distance matrix but have yet to be able to cluster the sequences and my errors are not reproducible. The distance matrix is nearly 50 GB. I’m using an AWS machine and have been scaling up memory with no luck.
What I’ve tried:
- cluster.split with method to split by distance, cutoff=0.03, large=T (this did not work regardless of memory in range from 32-256 GB)
- cluster.split with method to split by taxonomy, taxlevel=4, cutoff=0.03, large=T, cluster=F
followed by:
cluster.split(file=, processors ranging from 1-64)
I found that with 1 processor the program threw multiple errors stating:
“[ERROR]: Your count table contains more than 1 sequence named , sequence names must be unique. Please correct.”
As I increased the number of processors, the error regarding the count files disappeared and instead I got multiple errors along the lines of:
“cannot open xyz1.dist.temp”
“clustering xyz1.dist.temp”
“cannot open xyz2.dist.temp”
“clustering xyz2.dist.temp”
Finally, it stated that it exceeded the allowable number of errors and quit. It seems that it was having trouble loading a distance file into memory for the cluster step, but then it would overcome this and actually cluster the file. Is this what is happening?
- cluster.split with method set to split by taxonomy, taxlevel=6, cutoff=0.03, cluster=F
With 16 cores and 64 GB of memory this took over a day and was not done, at which point I pulled it.
Finally, I got different errors running the same analysis on v1.39.5 and v1.44.1. With v1.39.5 (which I have previously used successfully with a smaller dataset) no matter the amount of memory or the number of processors, it just threw error after error stating:
“[ERROR]: Your count table contains more than 1 sequence named , sequence names must be unique. Please correct.”
I am really curious how the cluster.split command works, and what to be aware of when deploying this. Firstly, is it better to cluster with a fasta rather than a distance matrix? My assumption is that if you use a fasta file then the next step is to calculate pairwise distances, so if the distance matrix is premade this will cut down on analysis time. Perhaps I am wrong?
Secondly, why does it take more time to cluster at taxlevel=6 than taxlevel=4? I guess I thought that smaller groups would mean faster clustering, but I guess then there are more groups so the total analysis time is longer?
I mainly work with environmental samples with high diversity and as sequencing costs go down, sampling intensity goes up and the number of unique sequences increase. I completely understand the need for complete overlap to account for sequencing errors with the Illumina technology, but it’s probably easy to presume that as sampling intensity increases you may legitimately have a million or more unique sequences even with complete overlap of the V4 region.
Lastly, I fully realize that Windows is not ideal for this and that’s why everyone uses Linux. I have a personal Windows machine (don’t hate) and using Putty to ssh into a Linux server has been a pain. But I swear this is the last time I try to do this on a Windows machine! But now that I’ve proclaimed mea culpa, I would like to retain the taxonomy, distance matrix, and count table that I have already acquired, convert to Linux format, and try to run on the Mothur AWS. Can I trust the files that I’ve already created? Are there any utilities that you trust to do this conversion? I’m paying for the analysis out-of-pocket so I can’t entertain the option of rerunning the whole analysis on Linux (although perhaps I’m a victim of the sunk cost fallacy).