Making OTUs without distance matrix

mafernandez · September 19, 2019, 11:33am

Hi there.
I am following MiSeq SOP with some variations in the pipeline to better suit my needs and I face the same problem time after time: my distance matrix is too big to fit into memory/RAM of my computer and cluster() or even cluster.split() fail.
Following the suggestions exposed here to my initial doubt:

I tried to make an approximation based on taxonomy instead of distance to run the cluster.split with the ‘file=’ option. However, it is also taking so long…
Therefore, I decided to make an approximation to my data without using OTUs, just using taxonomic information coming from ‘classify.seqs’. My only problem would appear if I need to claculate richness index, but I think I could solve that.

And here is my question: If you follow MiSeq SOP you reach the step of ‘pre.cluster’, when you ‘merge’ sequences that are 1 nt each 100bp different. If after doing this, you repeat the command but asuming a difference of 3 nt each 100bp, would it be equivalent to make OTUs at a cutoff=0.03 using a distance matrix?
Would it be correct to treat these output sequences as OTUs for subsequent analyses?

Thanks

pschloss · September 19, 2019, 1:31pm

If you do 3/100 then you’re really making 6% OTUs since one sequence could be 3 off of two other sequences and those sequences would then be 6 off of each other.

Are you using the latest version of mothur? We have put in some new algorithms that use a smaller distance matrix and use less RAM.

Pat

mafernandez · September 19, 2019, 1:39pm

I see…

So, if I set pre.cluster to ndiffs=1/100bp is like having 98% similarity OTUs? This would work also…

Yes, I am using mothur v1.42.3 on a Windows server with 16 processors and 64gb of RAM, but my .dist files are over 70gb, so I think this is the main problem.

Thanks

pschloss · September 19, 2019, 2:05pm

Are you using cutoff=0.03 in dist.seqs?

mafernandez · September 19, 2019, 2:06pm

This is the command line I normally use:

dist.seqs(fasta=current, cutoff=0.03, output=column)

Alternatively, I use output=lt

Is anything missing?

pschloss · September 19, 2019, 3:32pm

You don’t want the lt output format. The problem is likely data quality. If you aren’t able to get fully overlapping reads, then you’ll have to use the phylotype approach shown in the MiSeq SOP.

Pat

mafernandez · September 19, 2019, 4:16pm

Thanks, I will try that also.
However, the latest analyses I have made are a merge of 2 dataset (one with 26 samples and the other with 9) that worked OK separately. The problem is when you put them together. Both datasets are 16S from bacteria (same primers, same lab) from the same sampling points and from similar environments (soil and rocks). Would it be a problem of the number of samples?

About the ‘pre.cluster’ question, would this step be assimilated as a OTU clustering at different similarity levels as I proposed before?

Kendra · September 19, 2019, 8:10pm

Do you have 2x250 miseq sequences or are they shorter? how much data per sample? have you tried the opticlust method?

system · September 29, 2019, 8:10pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stuck at cluster.split -- how do I overcome RAM issue? Commands in mothur	12	12755	August 20, 2013
cluster.split Commands in mothur	2	1445	March 10, 2016
Issues with cluster command Commands in mothur	5	4453	December 19, 2012
Problems when using cluster.split on huge .dist file Commands in mothur	2	1423	August 30, 2019
Cluster.split and computer characteristics	7	1848	October 23, 2019

Making OTUs without distance matrix

Related topics