I am having a little difficulty (I think) with the classifying of sequences. I have followed the SOP to the letter and when I classify (regardless of whether I use RDP old or new GreenGenes old or new etc, I’ve tried them all, even the RDP online classifier) I get about 30% of my seuqneces match to unclassified bacteria. It is an Antarctic soil 16S 454 run. I’m used to a little unclassified when using 454 simply because the repositoires are not all emcompassing but 30%? I feel like that’s a bit much.
Unclassified at what level? The root? Genus? Kirk is right that unclassifieds are common with short reads, but if they’re unclassified at the Root or Kingdom domain, then there might be a problem with the data.
They’re unclassified at the phylum level (so they get identified as bacteria) they’re probably about 200bp (after following the SOP, is that normal? it’s titanium 454). I reduced the cutoff to 60 and get some reasonable (>80%) matches at lower taxonomic ranks that I didn’t get before at 80% cutoff, identifying a few more of the unclassifieds.
Tris, yeah the length is normal (454 makes up numbers…). Can you try taking some of those sequences and blast’ing them against the nt database at NCBI? We actually came across this yesterday and they turned out to be mouse 18S. While your’s are unlikely to be mouse, they could be some other artifact. Regardless, if you can align them, they’re probably “real” but it’s just a matter of figuring out what they are - something weird or something novel.
I’ve blasted a handful of the abundant ones and some of the rares and they pretty much match to ‘uncultured bacterium’ clones, with some phylum representation wayyyyy down the list. So they seem like legitimate seqs. I tried to classify them with no cutoff and I got some really good matches (90-100% BSs) on previously unclassified ones. Does that seem right, I can’t rationalise that outcome, surely if the scores are that high they would have come out with the 80% cutoff?