Align.seqs removing most bp

Hi,

I’m following your tutorial with my own data and is working quite well until I try an align.seqs. Before, all of my sequences have around 250-275bp, after the alignment, the median number of bases goes all the way down to 4, and I’m not sure why this is happening because an alignment shouldn’t be removing bp. I’m going to paste relelvant info from the logfile:

mothur > summary.seqs(fasta=hmw292c01.shhh.trim.unique.fasta, name=hmw292c01.shhh.trim.unique.names)

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 230 230 0 3 1
2.5%-tile: 1 249 249 0 4 1865
25%-tile: 1 261 261 0 4 18650
Median: 1 267 267 0 5 37299
75%-tile: 1 273 273 0 5 55948
97.5%-tile: 1 286 286 0 7 72733
Maximum: 1 344 344 0 9 74597
Mean: 1 267.099 267.099 0 4.73669

of unique seqs: 40246

total # of seqs: 74597

mothur > align.seqs(fasta=hmw292c01.shhh.trim.unique.fasta, reference=silva.bacteria.fasta)

Some of you sequences generated alignments that eliminated too many bases, a list is provided in C:\Users\Mark\Downloads\Mothur.win_64\hmw292c01.shhh.trim.unique.flip.accnos. If you set the flip parameter to true mothur will try aligning the reverse compliment as well.
It took 675 secs to align 40246 sequences.

mothur > summary.seqs(fasta=hmw292c01.shhh.trim.unique.align, name=hmw292c01.shhh.trim.unique.names)

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 1865
25%-tile: 1044 1051 2 0 1 18650
Median: 43056 43116 4 0 2 37299
75%-tile: 43112 43116 11 0 2 55948
97.5%-tile: 43116 43116 27 0 4 72733
Maximum: 43116 43116 82 0 8 74597
Mean: 24777.4 24798.1 8.0463 0 1.86435

of unique seqs: 40246

total # of seqs: 74597

The problem is likely because your sequences are backwards. Do you know whether you sequenced from the 5’ to 3’ end of the gene or from the 3’ to 5’ end? In the SOP we sequenced from 3’ to 5’ and used the flip=T parameter in trim.seqs. If you are going from 5’ to 3’ then you do not want to use the flip=T parameter. You could also re-run align.seqs using flip=T as a parameter and it will test both directions and pick the best. You could also run reverse.seqs and then align.seq(flip=F).

Pat

Hey again,

Just wanted to check back in and tell you the problem was what you expected it to be. After running reverse.seqs and aligning, and filtering, and screening, and removing chimeras and contaminants I still have 26,000 quality, unique sequences.

Thank you for your help.

I have one more question, how do you put taxonomic information on a phylogenetic tree?

I ran dist.seqs(), then clearcut(), then classify.tree(), and there is really no difference betwen the .tre file between clearcut and classify.tree().

I also have opened up the .phylip.taxonomy.summary file and there have been things classified (just for additional information, I ran my taxonomy through the SOP and it looks fine).

The classify.tree command looks at each node in the tree and finds the consensus taxonomy for its descendants. The tree classify.tree creates is the same tree as clearcut, it just has the nodes labeled so you can look up the consensus taxonomy in the summary file.

Thanks for clearing that up, but that is also my problem - when I go to view the tree that classify.tree creates, the nodes are not labeled at all (I’m using Archaeopteryx to view - it’s too big for TreeViewer)

I downloaded Archaeopteryx and opened a tree classify.tree created and I can see the labels.

Let me post relevant info from the logfile:

mothur > dist.seqs(fasta=hmw292c01Final.fasta, output=lt, processors=8)

Output File Names:
C:\Users\Mark\Downloads\Mothur.win_64\hmw292c01Final.phylip.dist

mothur > clearcut(phylip=hmw292c01Final.phylip.dist)

Output File Names:
C:\Users\Mark\Downloads\Mothur.win_64\hmw292c01Final.phylip.tre

mothur > classify.tree(taxonomy=hmw292c01Final.taxonomy, tree=hmw292c01Final.phylip.tre, name=hmw292c01Final.names, group=hmw292c01Final.groups)

Output File Names:
C:\Users\Mark\Downloads\Mothur.win_64\hmw292c01Final.phylip.taxonomy.summary
C:\Users\Mark\Downloads\Mothur.win_64\hmw292c01Final.phylip.taxonomy.tre

If my commands look fine I’m not sure what else I can show you outside of sending you my files.

If its a larger tree, are you zoomed in enough? Can you open the *.tre file in a text editor and post the beginning of the tree string? It should look something like:

(((AY457780:-0.000186,((AY457809:0.005377,AY457748:0.005493)243:0.000789,…

243 is the node label for the pair (AY457809:0.005377,AY457748:0.005493). I can see this label on the tree in Archaeopteryx as well.

The start of the tree string is as follows:
(((((((HMW292C01BCKCU:0.1796,((((HMW292C01CFNA7:0.005409,HMW292C01CFAKO:0.005741)26047:0.005233,HMW292C01BHFWO:0.009667)26048:0.137505,(HMW292C01AWRVR:0.102916,

Edit: Yes, I’m zoomed in and have the appropriate checked boxes selected for displaying taxonimic information.

(((((((HMW292C01BCKCU:0.1796,((((HMW292C01CFNA7:0.005409,HMW292C01CFAKO:0.005741)26047:0.005233,HMW292C01BHFWO:0.009667)26048:0.137505,(HMW292C01AWRVR:0.102916,…

You have labels in the tree. I am not sure why you are not able to see them with the tree visualization software you are using. Would the classify.otu, http://www.mothur.org/wiki/Classify.otu command allow you to get the information you are looking for?

Apparently not. It’s odd that the tax file I get from classify.otu has syntax errors for classify.tree.

Here’s what the logfile says:
mothur > classify.tree(taxonomy=hmw292c01Final.an.unique.cons.taxonomy, name=hmw292c01Final.names, group=hmw292c01Final.pick.groups, tree=hmw292c01Final.phylip.tre)
[ERROR]: OTU is missing the final ‘;’, ignoring.
[ERROR]: Taxonomy is missing the final ‘;’, ignoring.
[ERROR]: Otu00002 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Bacteroidetes”(100);unclassified(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00004 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);Betaproteobacteria(100);Rhodocyclales(100);Rhodocyclaceae(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00006 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00008 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);Alphaproteobacteria(100);Rhizobiales(100);Bradyrhizobiaceae(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00010 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00012 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);Betaproteobacteria(100);Rhodocyclales(100);Rhodocyclaceae(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00014 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00016 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00018 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);Gammaproteobacteria(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00020 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00022 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00024 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00026 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00028 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Bacteroidetes”(100);unclassified(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: 86 is already in your taxonomy file, names must be unique./n[ERROR]: Otu00030 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00032 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00034 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);unclassified(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00036 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00038 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);Betaproteobacteria(100);Rhodocyclales(100);Rhodocyclaceae(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00040 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);unclassified(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00042 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);unclassified(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00044 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Bacteroidetes”(100);unclassified(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00046 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);unclassified(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00048 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00050 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Bacteroidetes”(100);unclassified(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00052 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp7(100);Acidobacteria_Gp7_order_incertae_sedis(100);Acidobacteria_Gp7_family_incertae_sedis(100);Gp7(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00054 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Acidobacteria”(100);Acidobacteria_Gp1(100);Acidobacteria_Gp1_order_incertae_sedis(100);Acidobacteria_Gp1_family_incertae_sedis(100);Gp1(100); is missing the final ‘;’, ignoring.
[ERROR]: Otu00056 is missing the final ‘;’, ignoring.
[ERROR]: Bacteria(100);“Proteobacteria”(100);Betaproteobacteria(100);unclassified(100);unclassified(100);unclassified(100); is missing the final ‘;’, ignoring.
[ERROR]: 54 is already in your taxonomy file, names must be unique./n

When I open the cons.taxonomy file everything looks okay to me.

The taxonomy file from classify.seqs is in this format: SequenceName taxonomy
GQY1XT001C296C Bacteria(100);“Bacteroidetes”(100);“Bacteroidia”(95);“Bacteroidales”(95);“Porphyromonadaceae”(91);unclassified;

The taxonomy file from classify.otu is in this format: OTULabel NumSeqsInOTU taxonomy
Otu01 29578 Bacteria(100);“Bacteroidetes”(100);“Bacteroidia”(100);“Bacteroidales”(100);“Porphyromonadaceae”(100);unclassified(100);

You don’t want to use the *.cons.taxonomy file with classify.tree. What I meant was, would the function of the classify.otu command work for your analysis instead of the classify.tree command?

No, I’m going to have to scrap that idea and try something else. In my groups file, I have 8 groups, and I would like to combine them.

First, the merge.groups needs a design file, where would I get that?

Second, my groups have dashes in them already, will that affect mothur’s ability to distinguish between different groups (because you separate multiple groups with dashes)?

You will need to create the design file, http://www.mothur.org/wiki/Design_File. Mothur does not have a command to do that. You can use the \ character to escape dashes in file names and group names.

Oh, I think I found the problem with the .tre file taxonomies not showing up. In the *.taxonomy.summary, the taxonomy column is completely unclassified or unknown. Do you have any idea why mothur isn’t classifying the sequences? The taxonomy file definitely has things classified.

I’m going to repost for clarity, because I reread my above post and couldn’t understand what I was saying:

I ran classify.tree with taxonomy, tree, name, and group parameters.

In the hmw292c01Final.phylip.taxonomy.summary, all sequences are unclassified or unknown. I expect to have unknown sequences, however, every single other sequences is unclassified.

Any idea why this is happening?

The consensus taxonomy requires more than 50% of the sequences for that node to be classified to the same taxonomy to report something other than unclassified. Here’s an example:

TreeNode NumRep Taxonomy
243 2 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Faecalibacterium(100);
244 2 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Faecalibacterium(100);

481 4 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Faecalibacterium(100);
482 235 Bacteria(100);unclassified;unclassified;unclassified;unclassified;unclassified;
483 242 Bacteria(100);unclassified;unclassified;unclassified;unclassified;unclassified;

For node 483 there are 242 sequences, the largest group 103 classify to Bacteria(100);Firmicutes(100); but 103 is not greater than 121 so node 483’s results are Bacteria(100);unclassified;.

Classify.tree isn’t even classifying anything as bacteria. This is how my taxonomy summary looks…

TreeNode Group NumRep Taxonomy
26047 22A16 2 unclassified;
26048 22A16 2 unclassified;
26048 25A16 1 unknown(100);
26049 22A16 3 unclassified;
26050 21A16 1 unknown(100);
26050 22A16 3 unclassified;
26051 21A16 1 unknown(100);
26051 22A16 5 unclassified;
26051 25A16 1 unknown(100)

I figure there must be something I’m doing wrong but I don’t know what… I left the cutoff at default 51%

What are the classifications of the 2 sequences from group 22A16 at node 26047? You should be able to open your tree file in a text editor and search for 26047 to find the names in the tree. Put those names in a accnos file and run:

get.seqs(names=yourNamesFile, group=yourGroupFile, taxonomy=yourTaxonomyFile, accnos=yourAccnosFile)
get.groups(name=current, group=current, taxonomy=current, group=22A16)