I am currently processing amplicon sequencing data (Illumina MiSeq) of piezophillic organisms cultivated with long chain alkanes under very high pressure (up to 200 bar). The original inoculum came from a deep sea core sample and hence it would seem likely to me that not all 16S sequences can be classified, even using the RDP trainset 10 (pds version).
I am inclined to not kick out these “unknowns”: although I also don’t want to actually include them in elaborate analysis, I would like to see if there is a relation with my design and more “unknown” 16S occurs at higher pressures (incubations were done at different, increasing, pressures).
So my question is how cluster.split handles “unknown” sequences and if it is better in this case to use the regular cluster command.
If I am making a major mistake here by not kicking out the unknowns when I had to please let me know. I am aware that this is not necessarily a proxy for unknown diversity and could also be sequencing erors, but if there is a systematic correlation with the experimental design this might be an indication that we are looking at some unknown stuff as we increase the pressure, no :?:
Thanks in advance for your input.