Making OTUs without distance matrix

Hi there.
I am following MiSeq SOP with some variations in the pipeline to better suit my needs and I face the same problem time after time: my distance matrix is too big to fit into memory/RAM of my computer and cluster() or even cluster.split() fail.
Following the suggestions exposed here to my initial doubt:

I tried to make an approximation based on taxonomy instead of distance to run the cluster.split with the ‘file=’ option. However, it is also taking so long…
Therefore, I decided to make an approximation to my data without using OTUs, just using taxonomic information coming from ‘classify.seqs’. My only problem would appear if I need to claculate richness index, but I think I could solve that.

And here is my question: If you follow MiSeq SOP you reach the step of ‘pre.cluster’, when you ‘merge’ sequences that are 1 nt each 100bp different. If after doing this, you repeat the command but asuming a difference of 3 nt each 100bp, would it be equivalent to make OTUs at a cutoff=0.03 using a distance matrix?
Would it be correct to treat these output sequences as OTUs for subsequent analyses?

Thanks

If you do 3/100 then you’re really making 6% OTUs since one sequence could be 3 off of two other sequences and those sequences would then be 6 off of each other.

Are you using the latest version of mothur? We have put in some new algorithms that use a smaller distance matrix and use less RAM.

Pat

I see…

So, if I set pre.cluster to ndiffs=1/100bp is like having 98% similarity OTUs? This would work also…

Yes, I am using mothur v1.42.3 on a Windows server with 16 processors and 64gb of RAM, but my .dist files are over 70gb, so I think this is the main problem.

Thanks

Are you using cutoff=0.03 in dist.seqs?

This is the command line I normally use:

dist.seqs(fasta=current, cutoff=0.03, output=column)

Alternatively, I use output=lt

Is anything missing?

You don’t want the lt output format. The problem is likely data quality. If you aren’t able to get fully overlapping reads, then you’ll have to use the phylotype approach shown in the MiSeq SOP.

Pat

Thanks, I will try that also.
However, the latest analyses I have made are a merge of 2 dataset (one with 26 samples and the other with 9) that worked OK separately. The problem is when you put them together. Both datasets are 16S from bacteria (same primers, same lab) from the same sampling points and from similar environments (soil and rocks). Would it be a problem of the number of samples?

About the ‘pre.cluster’ question, would this step be assimilated as a OTU clustering at different similarity levels as I proposed before?

Do you have 2x250 miseq sequences or are they shorter? how much data per sample? have you tried the opticlust method?

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.