Dist.seq output too big


I am a newbie to Mothur and am currently following the Miseq SOP. Right now I am at dist.seq of OTU analysis step and am getting an output file in the region of 200GB and still growing! I have 12 samples, 1.6million total reads and 354k unique reads. May I know whether it is advisable to continue? (I estimate the final .dist output will be around 500-600GB?!?), because I am worried that it will affect the clustering step later? Thank you!!


yeah you may as well quit it now. what are you sequencing? if your reads do not fully overlap you’re likely forced to just use the phylotype based approach since your error rate is so high. the result of a high error rate is an inflated number of unique sequences.

Thank you Pat, I didn’t realise I have a problem with high error rates until you mentioned!

I’m sequencing a 16s rRNA fragment spanning the V5 to V8 region and is around 500-600bp long. Yes the reads do not fully overlap. I am trying out the phylotype-based approach now. Would you also advise whether it is feasible for me to use the split.abund command to divide my sequences into abundant and rare groups, and then only using the abundant sequences to continue with OTU-based approach? Rationale being most of the unique sequences with few representations are likely to be due to sequencing error?

Thank you so much!


Sorry, but I don’t think the OTU-based approaches will work for you.