When I run MOTHUR on my samples I used the same SILVA database that is available in the SOP for the alignment. However, I get a lot of unclassified bugs so what I do is to classify them using BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome). Is there a way to do this in MOTHUR by using a database from NCBI and SILVA at the same time?
Yes, blast is one of the methods in classify.seqs. That being said, it’s a really horrible way to do classification since the database is minimally classified, only provides a local (not global) alignment, and doesn’t help you sort out ties when a sequence is similar to multiple references.
I’ve got a similar question. I’m working with bacterial communities from marine sediments amplified by V3-V4 MiSeq 2x300. I applied Mother pipeline and got my OTU table against SILVA. The most abundant OTU has been classified only at the Domain level (Bacteria), so I’m trying to classify those sequences against other reference databases to get a finer classification. Using RDP, I got the same taxonomic classification, i.e., classified only at the Domain level. However, when I used blast from NCBI, the sequences are classified to the species level. One of the first bacterial species assigned to these sequences (Desulfotalea psychrophila; sorted by maximum score) makes a lot of sense in these samples from marine sediments.
My question is, should I trust NCBI blast with regard to this classification, at least at the Class level? Below I’ve attached an screenshot from part of the output from NCBI blast.
Looking forward to finding an answer. Thanks in advance.
I would not trust NCBIs taxonomy or blast reports. You’ll note that the Ident column in the screenshot you sent only has 82% identity. This is pretty poor.
Many thanks for the reply. Is there any way to figure out whether these unclassified sequences are an artefact or are indeed a new bacterial taxa? As I said, I’ve tried with RDP with the very same results, that is, unclassified_Bacteria.
You may want to try the seqmatch tool at the RDP website, which lets you select the taxonomy. The “method=knn, search=kmer” approach of clasify.seqs is probably similar to seqmatch.
when you say a lot, what % are you talking about? Are they totally unclassified or classified to bacteria and nothing further? working in marine sediment, I wouldn’t be terribly surprised about 25% or more that only classify to bacteria and nothing further.
Thanks a lot for your suggestion. I’ve accessed seqmatch from RDP website (https://rdp.cme.msu.edu/seqmatch/seqmatch_sum.jsp?qvector=255&depth=0¤tRoot=0&num=20), but I cannot see the parameters you specified. I nevertheless proceeded and got the output that you can see below. It seems those reads are assigned to Firmicutes phylum with a high identity % (up to 0.93%). Am I interpreting the output correctly?
This unclassified (unclassified_Bacteria) OTU represents up to 26% of my bacterial communities at the class level. So your observation is highly accurate . As I mentioned, Mothur classified those sequences only at the Domain level (i.e., Bacteria). Since such OTU represents a remarkably high proportion of the communities, I’d like to figure out how to classify these sequences as far as possible, at least at phylum level.
I just want to gently put out there that just because one algorithm gives you a deeper classification, that doesn’t make it correct. I would stick with the RDP naive Bayesian approach that their website and mothur’s classify.seqs defaults to. There are a lot of problems with a knn approach.
Another thing to comment on is that you’re using 300PE chemistry to sequence the V3-V4 region. The 300PE chemistry is pretty awful for the last 100 bp. I worry that you might have a data quality issue. See this blogpost for more information.
Pat, I’m testing out the 600bp chem again. Illumina insists they’ve fixed the issues. I’ll let you know how that looks in the next few weeks.
Juanjo-I’m with Pat. attaching a name isn’t worth much, esp if the named sequence is distant from your sequence. Embrace that the world of sediment microbiology is largely unknown!
Can you see why similarity isn’t being calculated? I don’t think s_ab score should be interpreted as equal to similarity.
3. A similarity score. SeqMatch reports the percent sequence identity over all pairwise comparable positions when run with aligned myRDP sequences. (Comparable positions are aligned positions containing a base in both sequences). Note that the rank order may differ between S_ab and pairwise identity scores, but the top 20 S_ab scores will contain the closest sequence by pairwise identity about 95% of the time (Cole et al). If two sequences do not overlap, the similarity between these two sequences will be displayed as “?”.
4. A seqmatch score (S_ab). These are the number of (unique) 7-base oligomers shared between your sequence and a given RDP sequence divided by the lowest number of unique oligos in either of the two sequences.