merging .shared files

amumford · January 19, 2016, 3:24pm

Hello All,
In order to speed processing times and avoid oversize distance matrices, I’ve taken to separating my analyses (MiSeq v4) into smaller, more easily digested chunks.
This works quite well, but I’m running into an issue down the pipeline—namely, this leaves me in a place where I can’t really use any of the nifty statistical tools mothur offers on my entire datasets as all of these start with a .shared file—which is currently in 4-6 pieces depending on the dataset.
Is there a tool/command that I’m just not seeing for merging .shared and/or .list files similar to merge.taxsummaries which would allow me to put these back together?
Or, if this is a bad and/or stupid idea, please tell me why. . .
Thanks,
Adam Mumford

Kendra · January 19, 2016, 6:26pm

You can’t compare OTUs between different clustering jobs. I’ll run the SOP through chimera checking on sets of the data (including pre-clustering!) then concatenate my fasta and count files before running cluster.split. If you are running into huge dist files still, you could set your taxlevel to 5 or even 6 (though at 6 you will likely not get anyclusters above 3%)

amumford · January 19, 2016, 7:31pm

Thanks! I was afraid that there would be issues with clustering OTUs between jobs. …
I’ll try combining the .fasta and .count files and give it a whirl.
-Adam

awoods6511 · August 20, 2019, 3:51pm

Hi! I know this post was made in 2016 however I am new to the field and running into issues with large distance matrices.
I am curious what your cluster.split command looked like when you concatenated your fasta and count files.
Thanks for any help you can give!

Kendra · August 20, 2019, 4:04pm

What are you samples? How were they sequenced?

awoods6511 · August 20, 2019, 4:08pm

I have 26 samples that were sequenced with Illumina HiSeq sequencing.

The fasta file I am trying to get through cluster.split is 11.31GB. I continuously getting the “Killed:9” error when I run cluster.split and it isn’t memory as I have over a TB of space left when it crashes.

I am following the MiSeq SOP wanting to analyze based on OTUs.

Kendra · August 20, 2019, 4:13pm

Have you read this? http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/

I’d try opticlust rather than clustersplit and very few processors. You mention over 1TB space, but that sounds like harddrive. The issue is RAM not hard drive memory. To cluster a 11.31GB matrix, you must have more than 11.31GB RAM for each processor

amumford · August 20, 2019, 4:23pm

This is a bit of a blast from the past…
Between improvements to mothur (opti clustering, especially) and improvements in processing power on my end this became something of a non-issue to me.
My call for cluster.split is:

cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=5, cutoff=0.03, method=opti, processors=128)

This had no issues handling a MiSeq (250bp 16S) dataset of 88 samples, average 100k reads/sample.

awoods6511 · August 20, 2019, 4:34pm

Thank you both for your help. I will keep trying these things out!

Topic		Replies	Views
duration of analyses Commands in mothur	7	5265	May 30, 2014
cluster.split claims count file is missing sequences and refuses to merge files mothur bugs	1	1623	July 18, 2016
cluster.split Commands in mothur	4	1282	May 26, 2017
Correspondance between OTU numbers between runs Commands in mothur	7	322	September 23, 2022
cluster split failing Commands in mothur	9	3668	October 14, 2015

merging .shared files

Related topics