Unclassified reads in OTU data

Hi everyone,

I am completing an analysis for some Ocular microbiome stuff in which we targeted the V4 region. When running the QC (classify.seqs) I am able to decipher the origin of the unclassified reads from stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.tax.summary by using the rank IDs so for example instead of saying a read is unclassifed I can describe it as unclassified, Betaproteobacteria. My question is, for my final analysis of the data I run all of the sequencing reads through the whole pipeline (through cluster.split) to get genus level OTU data and use this to make relative abundance graphs and a corresponding Bray-Curtis dendrogram, how can I get the same sort of information for the unclassifed reads? I can’t seem to locate where this info is. This is a problem for me because when I run the QC and determine a threshold at which to identify “Ocutypes” (samples dominated by distinct bacterial loads) I have 3 additional groups for various unclassifed reads but in my OTU data I only have 1 unclassified ocutype, so these reads are being all grouped together which is not truly representative. Thanks for taking the time to read this and for you expertise.

You’re using classification to cluster the samples? Why not use OTUs? The problem of all unclassified going together is exactly the problem that OTUs were created to overcome.

I’m not sure you read my entire question. I only run samples through classify.seqs for quality control purposes during an active study. Once we have sequenced all of the samples for a particular study all of the samples are run through the entire Mothur pipeline (all the way to cluster.split) to generate genus level OTUs. From this data how can I discern the origin of the unclassified reads?

You are correct, I don’t understand why you aren’t clustering this preliminary data.

So we don’t process samples all the way to cluster.split because of the computational time and resources needed. Also, I am just following suit of the person in charge. For example, we get 10 samples on a sequencing run and the project will evenutally have 100 samples. In the interim, we report the taxonomic classifications to our users using classify.seqs. So essentially, we can say sample 1 is dominated by Lacto, sample 2 is Coryne & Strep so on and so forth.

When I get all 100 samples, I create a stability.files that includes all of the samples and I launch my script that will follow the entire MiSeq Mothur SOP resulting in OTU data. Then I use the R package PhyloSeq to ascribe taxonomy to the OTU so once again I have samples labeled with Genus-level information and I can report the same sort of relative abundance graphs showing the bacterial compostion in each sample.

The problem for me is I am being asked to provide the family level information for the reads that are desingated as “unclassified”. In the reads that are not OTU level this was easy to find as you could follow the rank ID from the stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.tax.summary file.

Is this approach reasonable, maybe I am missing something more theoretical? My Director would like to be able to report unclassifed reads more descriptively, for example “unclassified, Firmicutes”. He also believes these reads may skew my Bray-Curtis dendrogram if I don’t ascribe family level data to these reads. Any help would be greatly appreciate, I’m going crazy trying to figure out what to do.

ok if you’re literally talking about 100 samples total, I would fully cluster as each batch is sequenced. That’s not going to take that much computational resources. If you are following the SOP, run each batch through chimera checking and save those files. When you get a new batch run it through chimera checking, concatenate all the files together, realign all of them together and proceed with the SOP.

If your director still insists that you shouldn’t cluster as each batch comes off the machine :roll: then you can do what you are suggesting by concatenating the phyla and genus fields

phyla.genus <- paste(phyla, genus, sep=".")