In order to speed processing times and avoid oversize distance matrices, I’ve taken to separating my analyses (MiSeq v4) into smaller, more easily digested chunks.
This works quite well, but I’m running into an issue down the pipeline—namely, this leaves me in a place where I can’t really use any of the nifty statistical tools mothur offers on my entire datasets as all of these start with a .shared file—which is currently in 4-6 pieces depending on the dataset.
Is there a tool/command that I’m just not seeing for merging .shared and/or .list files similar to merge.taxsummaries which would allow me to put these back together?
Or, if this is a bad and/or stupid idea, please tell me why. . .
You can’t compare OTUs between different clustering jobs. I’ll run the SOP through chimera checking on sets of the data (including pre-clustering!) then concatenate my fasta and count files before running cluster.split. If you are running into huge dist files still, you could set your taxlevel to 5 or even 6 (though at 6 you will likely not get anyclusters above 3%)
Thanks! I was afraid that there would be issues with clustering OTUs between jobs. …
I’ll try combining the .fasta and .count files and give it a whirl.
Hi! I know this post was made in 2016 however I am new to the field and running into issues with large distance matrices.
I am curious what your cluster.split command looked like when you concatenated your fasta and count files.
Thanks for any help you can give!
What are you samples? How were they sequenced?
I have 26 samples that were sequenced with Illumina HiSeq sequencing.
The fasta file I am trying to get through cluster.split is 11.31GB. I continuously getting the “Killed:9” error when I run cluster.split and it isn’t memory as I have over a TB of space left when it crashes.
I am following the MiSeq SOP wanting to analyze based on OTUs.
Have you read this? http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/
I’d try opticlust rather than clustersplit and very few processors. You mention over 1TB space, but that sounds like harddrive. The issue is RAM not hard drive memory. To cluster a 11.31GB matrix, you must have more than 11.31GB RAM for each processor
This is a bit of a blast from the past…
Between improvements to mothur (opti clustering, especially) and improvements in processing power on my end this became something of a non-issue to me.
My call for cluster.split is:
cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=5, cutoff=0.03, method=opti, processors=128)
This had no issues handling a MiSeq (250bp 16S) dataset of 88 samples, average 100k reads/sample.
Thank you both for your help. I will keep trying these things out!