Dist.seqs running for many days/large file

amwalkero0o · April 8, 2020, 4:28am

Hello,

I have been running dist.seqs on 179 samples for 3 days. The command looks like this:

dist.seqs(fasta=NPRB20.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.20, processors=1)

The .dist file thus far is already 761G and the command is still running. Does this indicate that it is just running away with itself? I was reading on the wiki page that

If you know that you are not going to form OTUs with distances larger than 0.10, you can tell mothur to not save any distances larger than 0.10. This will significantly cut down on the amount of hard drive space required to store the matrix.

However, I wasn’t sure how to “know” that I wasn’t going to be forming OTUs with distances larger than 0.10 so I used 0.20 as the cutoff. I’m not sure if this is correct, but I counted the number of lines in the .dist file for my test run (7 samples) that had a number greater than 0.10 and got 75,740,228. If this is what is meant by distances larger than 0.10 then it seems I should definitely be using 0.20. The command I used to count values greater than 0.10 is:

gawk -F"\t" 'NR>1 {if ($2>0.1) print $1;}' *.dist >> greater_0.10.txt
wc -l greater_0.10.txt

I am loathe to interrupt the command in case things are running as they should, but also unsure if I should let it continue in case it might go indefinitely and hit my storage quota. Any advice/input would be super appreciated!

TIA

Kendra · April 8, 2020, 5:18pm

unless you have 761GB ram, that will never complete. I recommend you use cutoff=0.03 and opticlust. If you don’t want to use opticlust, you will need to use cluster.split and taxlevel=3 or 4 depending on your community types.

amwalkero0o · April 8, 2020, 5:48pm

I’m a bit confused by your reply, as opticlust and a cutoff of 0.03 refers to the cluster command not the dist.seqs command.

Kendra · April 8, 2020, 7:23pm

Opticlust means that you don’t need to retain distances that are greater than 0.03 because sequences more than 0.03 different should never be put into the same OTU.

mothur > dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.03)

mothur > cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.dist,count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.pick.count_table)

amwalkero0o · April 8, 2020, 7:28pm

Thank you so much, I think I’m understanding better! So, if I use cutoff 0.03 in dist.seqs, I’m assuming I can choose a different cluster cutoff that is < 0.03, such as 0.02 or 0.01 as it will save all distances <0.03, is that correct? I don’t know why, but I didn’t inherently connect the distance cutoff with OTU cutoff. It would be great if the dist.seqs wiki connected the two .

pschloss · April 9, 2020, 3:17pm

cluster.split uses opticlust and if you use it, it will use 0.03 as the cutoff/OTU definition by default.

Pat

Kendra · April 9, 2020, 6:23pm

Pat, is there a reason to use cluster.split with opticlust? rather than just cluster?

pschloss · April 16, 2020, 4:23pm

If you have a big dataset and need to first partition it via taxonomy. cluster.split allows you to parallelize the clustering.

system · April 26, 2020, 4:23pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
dist.seq- taking lot of disk space Commands in mothur	1	1214	January 28, 2016
Making OTUs without distance matrix Theory behind mothur	8	873	September 29, 2019
HUGE dist file when running Eukarya analysis Commands in mothur	3	618	August 10, 2019
Dist.seqs failed, is there a way to predict the size of the .dist file? Commands in mothur	2	338	July 24, 2022
procedure to avoid large distance matrix Theory behind mothur	5	3442	December 21, 2015

Dist.seqs running for many days/large file

Related topics