procedure to avoid large distance matrix

Hi again, after very useful help with my first question I have now another. As I´m new in this kind of analyses and mothur hopefully somebody with more experience and knowledge can help me that good again…

The overall situation is I got various differently big fasta files (biggest with 1897954 sequences) and have to analyze them (otu,Shannon,shao…)
I had no problem with the smaller files but now with the bigger I get the problem that my matrix become far too big (>100gb) and I read in that this isn´t good and something is wrong. ->High error rate.

My order is to get as many sequences classified as possible. I will post my commands here and hopefully somebody can help me to optimize them and finally solve this problem.

I use current mothur version and current Silva files from mothur webpage.

Commands:, groups=abc)
screen.seqs(fasta=test1.fasta, group=abc.groups, maxambig=0, optimize=start-end, criteria=98, processors=x)
count.seqs(name=test1.good.names, group=abc.good.groups)
align.seqs(fasta=test1.good.unique.fasta, reference=silva.bacteria.fasta, flip=T)

screen.seqs(fasta=test1.good.unique.align, count=test1.good.count_table, optimize=start-end criteria=98)
(If criteria 98 is not possible and my start>end I set the range manually with help of summary.seqs.)

filter.seqs(fasta=test1.good.unique.good.align, vertical=T, trump=.)
unique.seqs(fasta=test1.good.unique.good.filter.fasta, count=test1.good.good.count_table)
pre.cluster(fasta=test1.good.unique.good.filter.unique.fasta, count=test1.good.unique.good.filter.count_table, diffs=3)

chimera.uchime(fasta=test1.good.unique.good.filter.unique.precluster.fasta, count=test1.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

remove.seqs(fasta=test1.good.unique.good.filter.unique.precluster.fasta, accnos=test1.good.unique.good.filter.unique.precluster.uchime.accnos)

classify.seqs(fasta=test1.good.unique.good.filter.unique.precluster.pick.fasta, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=trainset14_032015.rdp.fasta,, cutoff=80)

dist.seqs(fasta=test1.good.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.11)

cluster(column=test1.good.unique.good.filter.unique.precluster.pick.dist, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table, cutoff=0.20)

make.shared(, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table, label=0.03)

classify.otu(, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table,, label=0.03)


mothur > dist.seqs(fasta=….precluster.pick.fasta, output=phylip, cutoff=0.11)

cluster(phylip=current, cutoff=0.20)

summary.single(list=current, label=unique-0.03-0.05-0.10)

It would be really helpful if you can check the commands and say if they are totally wrong and that I should use command x with setting yz instead. I assume that this many sequences can´t be calculated this way and that I have to reduce my criteria in screen.seqs and calculate only less number of sequences? Which criteria do you prefer and how much can be handled in the later steps? The other alternative would be a phylotype-based approach?

In short:

  • Can you check the commands if the procedure is correct
  • Can you give me a hint how to optimize or customize the procedure to prevent >100gb distant matrix.

If you need more Information please ask. Otherwise big thanks for the help again!


dist.seqs(fasta=test1.good.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.11)

cluster(column=test1.good.unique.good.filter.unique.precluster.pick.dist, count=test1.good.unique.good.filter.unique.precluster.uchime.pick.count_table, cutoff=0.20)

These two steps could be replaced with cluster.split and that should help a little with the memory issues. FWIW, both of these cutoff values should be 0.20.

Honestly, the best way to get around the gigantic distance matrix would be to generate better data using the V2 chemistry to sequence the V4 region as outlined in the blog post.


First big thanks for your help!
I read the cluster.split command too but I thought it wouldn´t solve the problem.
If I analyze one of my files the way I mentioned I get from my initial 1897954 sequences a distance matrix ~500gb large. Sure a lot of sequences are eliminated during the steps but there are still enough that the matrix is getting far over 100gb big.
In your blog post you mentioned that it´s result of not perfectly overlapping sequences which result in
“• Inflated number of OTUs and diversity
• Increased distance between samples
• Increased difficulty in identifying chimeras”

Cluster.split won´t solve this problem or am I wrong? I get the matrix clustered but it most probably still has high error rate!? So I can´t really use my OTUs in the end because I don´t really know how reliable they are!?
So only real solution would be to cut your file via screen.seqs and analyse only a part of the sequences and not all of them together?

Thanks again for your help

cluster.split will help, but in your case it probably won’t solve the problem.

Again big thanks for your time and help.
The summary of my initial fasta file is this:

Start End Nbases Ambigs Polymer NumSeqs
Minimum: 1 300 300 0 3 1
2.5%-tile: 1 322 322 0 3 47449
25%-tile: 1 328 328 0 5 474489
Median: 1 446 446 0 5 948978
75%-tile: 1 462 462 0 5 1423466
97.5%-tile: 1 471 471 0 6 1850506
Maximum: 1 592 592 2 54 1897954
Mean: 1 411.29 411.29 6,74E-01 476.378

of Seqs: 1897954

I worked with it the way I mentioned before. Finally I get the large distance matrix which then killed me…. Can you give me a hint how to work with the data to avoid this problem and get reliable results? Or do you know somebody who could help me?

All I can recommend is in