Successful use of cluster.split with Windows?

I’ve been stuck for a couple of months trying to cluster a large data set. The sequenced region is V3-V4 without complete overlap and likely high sequencing error. I’ve been mainly following the steps in the MiSeq protocol. The original number of sequences from about two dozen samples totaled close to 4 million, with unique sequences around 450k. I was able to create a column distance matrix but have yet to be able to cluster the sequences and my errors are not reproducible. The distance matrix is nearly 50 GB. I’m using an AWS machine and have been scaling up memory with no luck.

What I’ve tried:

  1. cluster.split with method to split by distance, cutoff=0.03, large=T (this did not work regardless of memory in range from 32-256 GB)
  2. cluster.split with method to split by taxonomy, taxlevel=4, cutoff=0.03, large=T, cluster=F
    followed by:
    cluster.split(file=, processors ranging from 1-64)

I found that with 1 processor the program threw multiple errors stating:
“[ERROR]: Your count table contains more than 1 sequence named , sequence names must be unique. Please correct.”
As I increased the number of processors, the error regarding the count files disappeared and instead I got multiple errors along the lines of:
“cannot open xyz1.dist.temp”
“clustering xyz1.dist.temp”
“cannot open xyz2.dist.temp”
“clustering xyz2.dist.temp”
Finally, it stated that it exceeded the allowable number of errors and quit. It seems that it was having trouble loading a distance file into memory for the cluster step, but then it would overcome this and actually cluster the file. Is this what is happening?

  1. cluster.split with method set to split by taxonomy, taxlevel=6, cutoff=0.03, cluster=F
    With 16 cores and 64 GB of memory this took over a day and was not done, at which point I pulled it.

Finally, I got different errors running the same analysis on v1.39.5 and v1.44.1. With v1.39.5 (which I have previously used successfully with a smaller dataset) no matter the amount of memory or the number of processors, it just threw error after error stating:

“[ERROR]: Your count table contains more than 1 sequence named , sequence names must be unique. Please correct.”

I am really curious how the cluster.split command works, and what to be aware of when deploying this. Firstly, is it better to cluster with a fasta rather than a distance matrix? My assumption is that if you use a fasta file then the next step is to calculate pairwise distances, so if the distance matrix is premade this will cut down on analysis time. Perhaps I am wrong?

Secondly, why does it take more time to cluster at taxlevel=6 than taxlevel=4? I guess I thought that smaller groups would mean faster clustering, but I guess then there are more groups so the total analysis time is longer?

I mainly work with environmental samples with high diversity and as sequencing costs go down, sampling intensity goes up and the number of unique sequences increase. I completely understand the need for complete overlap to account for sequencing errors with the Illumina technology, but it’s probably easy to presume that as sampling intensity increases you may legitimately have a million or more unique sequences even with complete overlap of the V4 region.

Lastly, I fully realize that Windows is not ideal for this and that’s why everyone uses Linux. I have a personal Windows machine (don’t hate) and using Putty to ssh into a Linux server has been a pain. But I swear this is the last time I try to do this on a Windows machine! But now that I’ve proclaimed mea culpa, I would like to retain the taxonomy, distance matrix, and count table that I have already acquired, convert to Linux format, and try to run on the Mothur AWS. Can I trust the files that I’ve already created? Are there any utilities that you trust to do this conversion? I’m paying for the analysis out-of-pocket so I can’t entertain the option of rerunning the whole analysis on Linux (although perhaps I’m a victim of the sunk cost fallacy).

Hi Lisa,

A few things…

  1. The method behind the large argument in cluster is garbage and shouldn’t be used.
  2. You should use the 1.44.2 version of mothur over 1.39.5

Ideally, cluster.split works by splitting the fasta file into smaller fasta files based on their taxonomy. It then calculates the distances between sequences within each fasta file (yes, this should be faster than generating the full distance matrix and then splitting). With each distance matrix, it then runs opticlust to form OTUs at the level specified by cutoff. If you do cluster=F, then it will calculate the distances, but not form OTUs. The benefit of this is that you can use a bunch of processors to generate the distance matrices, but then only use a few processors to cluster since each taxon will be put on a separate processor and you could swamp out your RAM if you have too many clustering procedures going at once. Do you have data showing taxlevel=6 runs slower than taxlevel=4?

The files should work on any operating system since they’re all text files. I probably wouldn’t use the mothur AMI, since the version of mothur hasn’t been updated in a while and I’ve been too swamped to update it. Regardless, you should be able to install mothur on AWS using a linux instance.

Hope this all helps a bit…

Thanks, that really helps a lot!

I actually have just been using a community Windows AMI and uploading the stuff I need. I truly appreciate how much faster Mothur runs on Linux, but I hate transferring files between Windows and Linux. I think I will consider investing in a Linux laptop just for interfacing with a Linux AWS.

My thinking on cluster.split with a distance file was completely wrong, and I really appreciate the explanation. I had thought that using the distance file would mean that the pairwise distances already calculated could be used for grouping OTUs. As for the data concerning it running longer, I pulled the operation because it was just too expensive. I bought a bunch more memory so I could forego the large=T option. When I ran this at taxlevel=4 with cluster=F it took about 21 hrs to complete. I couldn’t get the clustering to work without throwing too many errors and kicking me out of the program. So I repeated the cluster.split with taxlevel=6 and cluster=F but I killed it after ~24 hrs.

As for the text files working on any operating system without conversion, I have not found that to be true. I unfortunately ran into this before using the AWS Mother AMI to cluster into OTUs and then trying to use the files generated to do some multivariate work in my Windows version of Mothur. From what I read, there is are unix methods called dos2unix and unix2dos that can convert the text files so it shouldn’t be too hairy. But it sounds like only the clustering could be suspect and the fasta, taxonomy, and count tables are good, so I can try rerunning this on a Linux system with cluster.split ensuring I don’t use large=T.

Thanks again!


This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.