here we are with another question.
We performed the pre.cluster command with diffs=4 (V3-V4 region). Subsequently we ended up with the *list.list file with the different labels.
The MiSeq-SOP recommends the label 0.03. But why? Does it correlate to a certain diff-value? Does the choice of the diff-value influences the selection of a certain label? And if so, why?
In the following data analysis we would tend to use the label “0.02”. Our thinking is that this label consists of more “genera” then the “0.03” label.
Thanks for your help and patience with us.
You are using the OTU based approach which groups by sequence distance. The label (e.g. 0.03) corresponds to what distance the clustering was performed at. With a label of 0.03, sequences that were more than 97% similar were clustered together into one OTU. As you can see from your table as you increase the distance you are clustering (increasing label) the total number of OTUs at that distance decreases (numOtus) because each OTU now includes more sequences.
With 16S data it is generally accepted that a distance of 0.03 is a surrogate for species level, and 0.05 for genus. Your 0.02 label doesn’t consist of more genera, it contains more OTUs with each OTU representing sequences that are less than 2% different.
The diff value in pre.cluster is similar yet different and is explained nicely here. My understanding is that ( diffs / amplicon-length ) should be less than the label you want to cluster to generate OTUs, so the distance you wish to cluster for OTUs sets an upper limit on diffs in pre-cluster. Someone can correct me if I am wrong about that.
Thank you very much for your help and commenting on our question!