Trim reference taxonomy

Chris · February 21, 2012, 8:57am

Werner et al. in their new article ( http://www.ncbi.nlm.nih.gov/pubmed/21716311 ) suggest to trim reference sequences to the primer region before classifying our sequences. This action could result in higher confidence thresholds and improve classification depth. But it is not a simple task to construct such reference dataset. Would it be possible to implement an option to trim reference sequences according to the 16S region (V1-V2 etc)? Or is it possible to make available aligned datasets so we can cut out the necessary part?

pschloss · February 21, 2012, 1:10pm

Hi Chris,

Great question. Very doable within mothur. You align a version of the reference fasta file and then run filter.seqs(hard=whatever.filter, fasta=reference.align). The whatever.filter would be the filter file that was created when you ran filter.seqs on your actual dataset. Alternatively (and perhaps a better approach) would be to filter both your alignment and the reference alignment together - filter.seqs(fasta=reference.align-yoursequences.align, vertical=T, trump=.). Then you can run classify.seqs with your sequences and new reference file - mothur doesn’t care about the gaps in the alignment.

Hope this helps,
Pat

Kirk · February 23, 2012, 2:57pm

Hi,

these seem to be more sound approaches than just keeping the first 500 nt for my V1-V3 amplicons
Not that the results differ that dramatically, but still …

First approach worked fine (hard filter), the second alas, removed all sequences from the database.

It further increased the confidences and consequently further reduced the number of unclassified sequences.
Thanks for the idea!

pschloss · February 23, 2012, 8:51pm

Kirk - Yeah I wondered about that after I hit submit. Since there are sequences in the DB that aren’t full length, there are likely to be sequences in the reference db that don’t overlap in your region. You could remove those and try again.

Glad you had some success…
Pat

gwidmer · March 1, 2012, 2:56pm

After reading Werner et al., I had a similar question. I work with ~60-nt amplicons from the V6 region and, depending on the cutoff value and training set I use, I get between 30-50% unlassified reads. As I am reading this thread, I am wondering why you, Pat, are suggesting to use filter.seqs to generate trimmed reference file as such files would only be useful for classifying a particular dataset. Would it be possible to trim the aligned reference database according to the position of, say, the V6 or V1-V2 on the E.coli 16S rRNA gene and make such trimmed databases available to the community? If this makes any kind of sense, and you can give me simple instructions on how to do that, I would be happy to generate such reference files. thanks.

pschloss · March 1, 2012, 4:31pm

As I am reading this thread, I am wondering why you, Pat, are suggesting to use filter.seqs to generate trimmed reference file as such files would only be useful for classifying a particular dataset.

Right. As you point out the alternative would be to have region specific databases. We could figure that out. The problem is that while the regions are fairly well defined, the primer sites to amplify those regions is not so well defined or consistent. Also, depending on how people trim their sequences someone may say they are looking at the V13 region, but are really only looking at the V12 region after they’ve trimmed. I’m pretty adamant that people only consider bases that overlap in every one of their sequences, necessitating an initial alignment. It is then pretty easy to just use filter.seqs to make a dataset-specific reference set.

gwidmer · March 1, 2012, 6:50pm

Hi Pat, I aligned trainset6_032010.rdp.fasta and trimmed it to ~ 100 nt by removing sequence up- and downstream of V6. I did this with BioEdit. I then removed all “-” and saved fasta formated sequences in a txt file. I re-ran classify.seqs using this trimmed trainset6 file as template. Consistent with Werner et al., the number of unclassified reads went down; in this case from 41% to 28%. I would be happy to make the trimmed template file available to the mothur community, if you think this could be useful. I would prefer if you checked the file first though. I don’t want to upload a piece of garbage…

Giovanni

pschloss · March 2, 2012, 1:40pm

Hi Giovanni et al,

I’m in the process of updating a number of the reference files (RDP, silva, gg) and can post a number of the different regions.

Thanks!
Pat

Chris · March 6, 2012, 7:52am

I aligned greengenes dataset and now when I’m using filter.seqs (fasta=reference.align-yoursequences.align) then I get an error “Sequences are not all the same length, please correct.” Does it mean reference sequences?

pschloss · March 6, 2012, 1:12pm

What did you align the gg reference set to and what did you align your sequences to?

Chris · March 7, 2012, 10:55am

I used the same silva reference alignment from mothur site for both datasets.

westcott · March 7, 2012, 1:49pm

Could send you logfile to mothur.bugs@gmail.com?

Topic		Replies	Views
v3 and v4 region on Silvia and GreenGenes database. Commands in mothur	1	2815	September 23, 2014
Classify.seqs with silva.nr_v123 files provided by mothur Commands in mothur	3	1849	March 18, 2016
Filter out a specified sequence Feature requests	2	4532	August 31, 2010
align.seqs Commands in mothur	6	5404	August 28, 2010
classification leads to many unclassified Commands in mothur	5	6843	July 28, 2010

Trim reference taxonomy

Related topics