Dist.seqs running for many days/large file

Hello,

I have been running dist.seqs on 179 samples for 3 days. The command looks like this:

dist.seqs(fasta=NPRB20.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.20, processors=1)

The .dist file thus far is already 761G and the command is still running. Does this indicate that it is just running away with itself? I was reading on the wiki page that

If you know that you are not going to form OTUs with distances larger than 0.10, you can tell mothur to not save any distances larger than 0.10. This will significantly cut down on the amount of hard drive space required to store the matrix.

However, I wasn’t sure how to “know” that I wasn’t going to be forming OTUs with distances larger than 0.10 so I used 0.20 as the cutoff. I’m not sure if this is correct, but I counted the number of lines in the .dist file for my test run (7 samples) that had a number greater than 0.10 and got 75,740,228. If this is what is meant by distances larger than 0.10 then it seems I should definitely be using 0.20. The command I used to count values greater than 0.10 is:

gawk -F"\t" 'NR>1 {if ($2>0.1) print $1;}' *.dist >> greater_0.10.txt
wc -l greater_0.10.txt

I am loathe to interrupt the command in case things are running as they should, but also unsure if I should let it continue in case it might go indefinitely and hit my storage quota. Any advice/input would be super appreciated!

TIA

unless you have 761GB ram, that will never complete. I recommend you use cutoff=0.03 and opticlust. If you don’t want to use opticlust, you will need to use cluster.split and taxlevel=3 or 4 depending on your community types.

I’m a bit confused by your reply, as opticlust and a cutoff of 0.03 refers to the cluster command not the dist.seqs command.

Opticlust means that you don’t need to retain distances that are greater than 0.03 because sequences more than 0.03 different should never be put into the same OTU.

mothur > dist.seqs(fasta=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta, cutoff=0.03)

mothur > cluster(column=stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.pick.dist,count=stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.pick.pick.count_table)

Thank you so much, I think I’m understanding better! So, if I use cutoff 0.03 in dist.seqs, I’m assuming I can choose a different cluster cutoff that is < 0.03, such as 0.02 or 0.01 as it will save all distances <0.03, is that correct? I don’t know why, but I didn’t inherently connect the distance cutoff with OTU cutoff. It would be great if the dist.seqs wiki connected the two :slightly_smiling_face:.

1 Like

cluster.split uses opticlust and if you use it, it will use 0.03 as the cutoff/OTU definition by default.

Pat

1 Like

Pat, is there a reason to use cluster.split with opticlust? rather than just cluster?

If you have a big dataset and need to first partition it via taxonomy. cluster.split allows you to parallelize the clustering.

2 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.