Hi! I was hoping to get some help with the issue below. We use the following commands in order to obtain the representative sequences (without gaps) for each OTU, and their taxonomy.
get.oturep →
degap.seqs →
classify.seqs
We then use the output files to manually curate the most important OTUs (actually take the sequence and BLAST it, and then compare it to assigned taxonomy).
This works fine in mothur v. 1.39.5. However, in mothur 1.48.0, both degap.seqs and classify.seqs produce outputs that are not in the original order of the OTUs anymore. So the …rep.fasta file contains the representative sequences in the right order (for OTU1, OTU2, OTU3, …), but this order is lost after degapping and/or classifying the sequences. This seems to happen for both commands independently:
get.oturep →
degap.seqs → resulting …rep.ng.fasta file not in the order of OTUs anymore
classify.seqs → resulting …rep.ng.pds.wang.taxonomy file not in the order of OTUs anymore (also not if input is original rep.fasta file, instead of the rep.ng.fasta file)
This is a problem for our manual curation step, where we need to know the OTU each representative sequence belongs to. Would it be possible to please look into this and update the commands so that they retain the original order of OTUs?
An additional request is whether it would be possible for the degap.seqs command to produce an output that retains the OTU numbers? Right now the …rep.ng.fasta gives the sequence names and their sequences only, whereas the original …rep.fasta also contained the OTU numbers. It would be very useful if this information was retained, so that we can more easily merge/link output files.
Can you clarify what order they are in when they are changed? Or do you mean the sequence names are changing. If it is purely a matter of the order then we’ll probably need you to handle this on your own. For what it’s worth, you don’t need to degap the sequences prior to classification. classify.seqs will do that for you.
Aside from that…
I would strongly advise against the strategy you are taking. We have a function classify.otu that returns the consensus taxonomy for an OTU. This method is outlined in the MiSeq SOP. Your approach of finding the representative OTU and then classifying that is prone to getting lucky or unlucky based on what sequence is picked relative to the other sequences. You might want to look at Figure 2 from https://journals.asm.org/doi/10.1128/aem.02810-10. Aside from that figure, I don’t think I’ve ever used get.oturep in an analysis of mine.
I wouldn’t put much stock in using blast as a check of the naive Bayesian approach. There are a lot of problems with using blast for classification. Here are a few…
It is a local alignment of your sequence rather than a global alignment. This will make the sequences look more similar to each other than they really are
You are dependent on the taxonomy of the sequences in NCBI, which are generally researcher provided and often incorrect. The RDP, greengenes, and silva taxonomies are curated to fit a taxonomic outline
The taxonomy you’ll derive from blast is typically the top match or perhaps top few matches. These tend to overstate the classification of your sequence (you’ll see species and strain names). The naive Bayesian approach effectively creates a model for each genus and assesses the likelihood that your sequence came from each genus.
Thanks very much for your response. While I was producing a small example to better illustrate the issue I realised what the culprit might be. Both the degap.seqs and classify.seqs commands use multiple processors if not specified otherwise (since mother version 1.40.0), and I think they then return the degapped sequences and taxonomies in the order they are processed, instead of in the order of the input fasta file. After setting processors=1 the issue seems to be resolved.
Regarding your feedback on the strategy we are taking, we do not disagree with any of your points here.
1. We do use classify.otu (with the RDP reference database) to obtain the consensus taxonomy for each OTU. But given that no single reference database is perfect, we think it’s a good idea to then cross-validate the classifications. One way is to run classify.otu with another database, but usually we also manually check some of the most important OTUs (very abundant ones, or those that show statistically significant differences between groups, or those where the consensus is lower than 100) with BLAST.
2. The NCBI database is definitely imperfect, as you say, and we never rely on simply the top few BLAST hits. As a general rule, the hit has to have 100% query coverage and at least 99% identity before we’d take a closer look at it.