The .dist file thus far is already 761G and the command is still running. Does this indicate that it is just running away with itself? I was reading on the wiki page that
If you know that you are not going to form OTUs with distances larger than 0.10, you can tell mothur to not save any distances larger than 0.10. This will significantly cut down on the amount of hard drive space required to store the matrix.
However, I wasn’t sure how to “know” that I wasn’t going to be forming OTUs with distances larger than 0.10 so I used 0.20 as the cutoff. I’m not sure if this is correct, but I counted the number of lines in the .dist file for my test run (7 samples) that had a number greater than 0.10 and got 75,740,228. If this is what is meant by distances larger than 0.10 then it seems I should definitely be using 0.20. The command I used to count values greater than 0.10 is:
I am loathe to interrupt the command in case things are running as they should, but also unsure if I should let it continue in case it might go indefinitely and hit my storage quota. Any advice/input would be super appreciated!
unless you have 761GB ram, that will never complete. I recommend you use cutoff=0.03 and opticlust. If you don’t want to use opticlust, you will need to use cluster.split and taxlevel=3 or 4 depending on your community types.
Opticlust means that you don’t need to retain distances that are greater than 0.03 because sequences more than 0.03 different should never be put into the same OTU.
Thank you so much, I think I’m understanding better! So, if I use cutoff 0.03 in dist.seqs, I’m assuming I can choose a different cluster cutoff that is < 0.03, such as 0.02 or 0.01 as it will save all distances <0.03, is that correct? I don’t know why, but I didn’t inherently connect the distance cutoff with OTU cutoff. It would be great if the dist.seqs wiki connected the two .