I’ve recreated the collection of the 190,661 Silva v128 reference sequences following the README here:http://blog.mothur.org/2017/03/22/SILVA-v128-reference-files/
At the very end of the README, it is suggested that sequences should be classified using silva.nr_v128.align and silva.nr_v128.tax after running pcr.seqs for the V4 region on the *.align file.
Per the MiSeq SOP tutorial, classify.seqs is run citing the following (smaller) reference and taxonomy files: reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax
My question…It seems like the ‘trainset’ reference file in the MiSeq SOP contains full-length, and degapped sequences, rather than sequences that have been pcr.seq’d and aligned per the README protocol. I understand that running pcr.seqs on the 190,661 sequences will greatly reduce the computational time, but is it also necessary to degap these 190,661 V4 reference sequences prior to running classify.seqs? Maybe it doesn’t matter?