eye rolling "unique" question


I know this has probably been reference many times, and I tried to read through the faq but it still made no sense to me;

unique means each sequence in an OTU is identical, correct? So why is it that when I cluster and it hits to unique, I still end up with fewer sequences than my unique.filter.pick fasta, and some sequences still get clustered into OTUs. Why is that?

I tried looking for the answer onth e forums, though not admittedly very hard, and came up short.

thanks as always.

Because it looks like you ran unique.seqs before filter.seqs. After filter.seqs, run unique.seqs (and pre.cluster) and you should be good to go.

that makes perfect sense thanks so much.

I have a related question about the cluster() command’s output. I’m seeing different “unique” OTU counts in my *.rabund files depending on the method options used.

My “method=furthest” and “=average” outputs indicate 6,391 unique-distance OTUs, but “=nearest” indicates only 3,426 unique-distance OTUs.

I thought the clustering method should only impact the OTU binning at higher thresholds (0.01, 0.02, etc.) because the “unique” bins only contained identical sequences?

Could this be a calculation error, or have I misunderstood something?

I’ve read about the distinction between “unique” and “0.00” (and the fact that 0.0049 would be rounded down to 0.00), but I’m not seeing any “0.00” level OTU counts in my *.rabund files anyway.

Background info:
I’m basically following the 454 SOP, but I have ITS data that can’t be aligned (as far as I know), so I’ve skipped the align.seqs() and pre.cluster() steps.
I used the unique.seqs() command (count= 6,893).
I used the pairwise.seqs() command as an alternative to dist.seqs().

I’m also curious about the discrepancy between my unique.seqs count (6893) and the unique OTU counts (6391) after running cluster(), but I’m assuming this is possible because I calculated distances with “countends=F”. In this case, I think two sequences of differing length might be counted as distinct via the unique.seqs() command and yet produce a distance of 0.000 across their aligned space.

Thanks for any info you can provide.

A couple of years ago we switched to a hard cutoff so 0.00 = unique and 0.0049 would possibly be the 0.01 cluster (assuming there was nothing between 0.0050 and 0.01. So I’m not sure why the different methods would give you different results - unless you’re using a very old version of mothur. When you look at the top of your mothur session, what version does it say you’re using?


Thanks for the reply.

I was using a slightly outdated version.

However, I have just repeated the cluster(method=nearest) and cluster(method=average) runs with the newest Windows version (v.1.33.3, 64bit), and I’m seeing the exact same results.

In any case, I’m happy with the method=average OTU data, so this is mostly an academic (or bug reporting) question.

I’d be happy to upload the distance table if that’s helpful, but I think it’s ~800mb. (I ran pairwise.seqs() with “cutoff=1.0” because my first attempt with “cutoff=0.10” was running into the known issue where cluster(method=average) provides only unique-level OTUs).