Trim reference taxonomy

Werner et al. in their new article ( ) suggest to trim reference sequences to the primer region before classifying our sequences. This action could result in higher confidence thresholds and improve classification depth. But it is not a simple task to construct such reference dataset. Would it be possible to implement an option to trim reference sequences according to the 16S region (V1-V2 etc)? Or is it possible to make available aligned datasets so we can cut out the necessary part?

Hi Chris,

Great question. Very doable within mothur. You align a version of the reference fasta file and then run filter.seqs(hard=whatever.filter, fasta=reference.align). The whatever.filter would be the filter file that was created when you ran filter.seqs on your actual dataset. Alternatively (and perhaps a better approach) would be to filter both your alignment and the reference alignment together - filter.seqs(fasta=reference.align-yoursequences.align, vertical=T, trump=.). Then you can run classify.seqs with your sequences and new reference file - mothur doesn’t care about the gaps in the alignment.

Hope this helps,


these seem to be more sound approaches than just keeping the first 500 nt for my V1-V3 amplicons :slight_smile:
Not that the results differ that dramatically, but still …

First approach worked fine (hard filter), the second alas, removed all sequences from the database.

It further increased the confidences and consequently further reduced the number of unclassified sequences.
Thanks for the idea!

Kirk - Yeah I wondered about that after I hit submit. Since there are sequences in the DB that aren’t full length, there are likely to be sequences in the reference db that don’t overlap in your region. You could remove those and try again.

Glad you had some success…

After reading Werner et al., I had a similar question. I work with ~60-nt amplicons from the V6 region and, depending on the cutoff value and training set I use, I get between 30-50% unlassified reads. As I am reading this thread, I am wondering why you, Pat, are suggesting to use filter.seqs to generate trimmed reference file as such files would only be useful for classifying a particular dataset. Would it be possible to trim the aligned reference database according to the position of, say, the V6 or V1-V2 on the E.coli 16S rRNA gene and make such trimmed databases available to the community? If this makes any kind of sense, and you can give me simple instructions on how to do that, I would be happy to generate such reference files. thanks.

As I am reading this thread, I am wondering why you, Pat, are suggesting to use filter.seqs to generate trimmed reference file as such files would only be useful for classifying a particular dataset.

Right. As you point out the alternative would be to have region specific databases. We could figure that out. The problem is that while the regions are fairly well defined, the primer sites to amplify those regions is not so well defined or consistent. Also, depending on how people trim their sequences someone may say they are looking at the V13 region, but are really only looking at the V12 region after they’ve trimmed. I’m pretty adamant that people only consider bases that overlap in every one of their sequences, necessitating an initial alignment. It is then pretty easy to just use filter.seqs to make a dataset-specific reference set.

Hi Pat, I aligned trainset6_032010.rdp.fasta and trimmed it to ~ 100 nt by removing sequence up- and downstream of V6. I did this with BioEdit. I then removed all “-” and saved fasta formated sequences in a txt file. I re-ran classify.seqs using this trimmed trainset6 file as template. Consistent with Werner et al., the number of unclassified reads went down; in this case from 41% to 28%. I would be happy to make the trimmed template file available to the mothur community, if you think this could be useful. I would prefer if you checked the file first though. I don’t want to upload a piece of garbage…


Hi Giovanni et al,

I’m in the process of updating a number of the reference files (RDP, silva, gg) and can post a number of the different regions.


I aligned greengenes dataset and now when I’m using filter.seqs (fasta=reference.align-yoursequences.align) then I get an error “Sequences are not all the same length, please correct.” Does it mean reference sequences?

What did you align the gg reference set to and what did you align your sequences to?

I used the same silva reference alignment from mothur site for both datasets.

Could send you logfile to