classify.seqs: degap the V4 alignment file?


I’ve recreated the collection of the 190,661 Silva v128 reference sequences following the README here:

At the very end of the README, it is suggested that sequences should be classified using silva.nr_v128.align and after running pcr.seqs for the V4 region on the *.align file.

Per the MiSeq SOP tutorial, classify.seqs is run citing the following (smaller) reference and taxonomy files: reference=trainset9_032012.pds.fasta,

My question…It seems like the ‘trainset’ reference file in the MiSeq SOP contains full-length, and degapped sequences, rather than sequences that have been pcr.seq’d and aligned per the README protocol. I understand that running pcr.seqs on the 190,661 sequences will greatly reduce the computational time, but is it also necessary to degap these 190,661 V4 reference sequences prior to running classify.seqs? Maybe it doesn’t matter?


You would need to align the RDP training to the SILVA alignment to extract the V4 region like we do for the alignment in the MiSeq SOP. It doesn’t really matter whether you degap the sequences prior to classify.seqs since mothur will do that for you if you don’t do it.