First off, I know this question has been asked here before but this is a bit less ‘how do I do it’ and more ‘is this a good way to do it’.
I have some (and will be getting more) MiSeq metagenomic data from an environmental sample. We have a pipeline in place for binning and classification of metagenomic reads so I don’t plan to do that part in mothur but since those steps take a long time I’m looking for a quite way to give just look at the 16S community profile and see if it’s similar to what we’ve previously observed with pyrosequencing.
I’ve QCed the reads with the fastq.info and trim.seqs commands, then pooled the samples into fasta/name/groups files. From there, I tried to classify everything against the RDP9/PDS database and removed anything that couldn’t be classified, or was classified as eukaryotic/chloroplast/mitochondrial. The remaining reads were aligned and reads that couldn’t align were removed (this was a lot of them). The data was then screened/filtered and then re-classified as per the SOP.
This left me with 561 16S rRNA genes out of about 7 million reads. The thing I’m a bit worried about is that because this is shotgun sequencing, the screen.seqs command probably won’t be run as tightly as it would for amplicon sequencing where you have a defined start/stop position. I just picked my start/stop based on what looked like it would cover a reasonable amount of the aligned DNA, although I still lost about half of it. There were probably quite a few non-overlapping reads in what I kept anyway, which would make the filtering overly aggressive at removing alignment positions that weren’t shared between all reads. I’m trying to think of a better way to do this - I’d like to keep the alignment step because I think it’s a good test that I’m working with a real 16S read and not just some part of a genome that has some rough similarity, since I’m classifying it using the Bayesian method, not BLAST. If I was to align the reads, remove those that didn’t align and then de-align the reads and classify them without using a filter would this still result in a reasonable classification?
This won’t comprise the bulk of the study, but I thought it would be an interested side-aim to see if there are any gross distortions to the 16S rRNA gene abundances in the raw DNA vs what we see after PCR amplification.