I decide to use Silva taxonomy for classify.seqs step. So, I made unaligned fasta file from silva.nr_v123.align simply by removing ‘.’ and ‘-‘. We do not need aligned fasta file for classify.seqs, right?
Could you please comment on my reasons below?
In the whole data process, now I use only one taxonomy (Silva) instead of two (Silva and PDS). I guess this may the results more self-consistent.
More important, there are more sequences in Silva (172418) than in PDS (10801). I am afraid that using PDS may results more reads being wrongly classified as ‘unknown’ than using Silva. I am debating whether I should throw away reads classified as ‘unknown’. Could you please comment on this?
Thanks a lot.
With regards to alignment - the aligner ignores gaps in the sequences so there’s no harm in removing the alignment gaps before classifying. The only reason you may want those is if you want to filter your reference database to only account for the region of your amplicon. There was a paper published a few years ago that showed doing this improves accuracy of assignment and reduces the amount of unclassified sequences in your data set.
For SILVA vs RDP (vs Greengenes) there’s no right answer as to which taxonomy to use. Ultimately they both pull data from the same sources but they use different approaches to how they assign taxonomy and what they include in their database. The rule of thumb I use is that SILVA is more open to candidate phyla than RDP. Last time I checked there were about 80 phyla/candidate phyla in the SILVA taxonomy and about 40 in RDP.
My view is you should definitely throw away anything that can’t classify to the domain/phylum level.