classify.seqs taxonomy summary file formatting

Hi Pat,
Thanks for putting together the classify function. I’ve been using it to compare classifications across many pryotag sequenced samples, but because of the way the file is structured, it is difficult to match different classified samples together to compare classified abundances.

I noticed that each sample fasta file (say parsed from the trim function using the groups file) will generate it’s own tax.summary file after classifications. Each tax.summary file line has roughly the format:

0 0 Root 1 39725
1 0.1 Bacteria 26 39725
2 0.1.1 Acidobacteria 3 44

A major issue that makes the classification identifier (here 0.1.1 for Acidobacteria) difficult to use for matching samples is that for each sample tax.summary file the numbering is different, and is based on the list of the existing classifications in the sample rather than a master sort order. However, the classifications are listed - from what I can tell - in the same order (if present)? If I could get the output of all classifications with classification identifiers based on a master sorted list rather than numbered from what is only contained in the sample, this would make matching classifications across samples much much easier to parse.
How hard would this be to implement?

Jackson

Your idea is a good one and I appreciate your feedback. It shouldn’t be too hard to implement - it’s on the list.

If I may add to this request, what I’m really trying to do would be covered by support for a “group” file option in classify.seqs. After classifying the sequences, each sequence could be listed in a tax.summary file in column form as to which environment/ sample it comes from instead of as it currently exists where only one entry lists all total sequences classified at a particular taxonomic level. Right now I have a python script parsing this in a crude way over several tax.summary files, but as soon as the file format changes I’ll have to rewrite it again.

e.g.

taxlevel, rank ID, label, daughterlevels, total sample1 sample2 sample3 ...etc.
0 0 Root 1 55 10 15 30
1 0.1 Bacteria 5 25 5 10 10
2 0.1.1 Actinobacteria 1 6 0 2 4

I am attempting to create a figure as in Brazelton et al. 2010 (Fig. 4) http://www.pnas.org/content/107/4/1612.abstract.

Jackson

Ahhh… That’s a very cool idea. It won’t be in the next release, but it will definitely get in soon.

Thanks for the idea,
Pat