Otu analysis and stats!

Dear Pat,

A few questions about the database used in OTU analysis and Hypothesis testing approaches. I want to make sure I know the type of data I’m feeding into the analyses, please help me clarify some of my doubts:

collector’s curves: uses a table of all OTUs found (rare and abundant) on each sample?

diversity calculators: Chao uses only the OTUs present in each sample that represent 1 and 2 sequences?

unifrac.weighted and for other stats: uses the table of OTUs that have a minimum abundance >10 per sample, built with sequences that have a similarity of 97% per OTU and have a classification similary to genus level of 80% to a reference seq in the greengenes/RDP II classifier database?

PCoA: uses the table of OTUs that have a minimum abundance >10 per sample, and considers only presence/absence of the taxa, not phylogenetic distances…so I can use the rabund table for this?

Cheers,

Astrid



PCoA: for these analyses only the

collector’s curves: uses a table of all OTUs found (rare and abundant) on each sample?

Correct.

diversity calculators: Chao uses only the OTUs present in each sample that represent 1 and 2 sequences?

It also uses the total number of OTUs observed.

unifrac.weighted and for other stats: uses the table of OTUs that have a minimum abundance >10 per sample, built with sequences that have a similarity of 97% per OTU and have a classification similary to genus level of 80% to a reference seq in the greengenes/RDP II classifier database?

No. It uses a phylogenetic tree and either a name and group file or a count table

PCoA: uses the table of OTUs that have a minimum abundance >10 per sample, and considers only presence/absence of the taxa, not phylogenetic distances…so I can use the rabund table for this?

No. It uses a distance matrix. Any distance matrix.

Hello Pat,

Thanks for getting back to me so quickly…

Ok, so in the stats analysis the phylogenetic trees include all the sequences that have resulted after the sequencing processing steps have been applied, based only on alignment similarity not on taxonomic identity/similarity… I asked this because when I look at the OTU abundance table by group (create.database) I have found seqs at phylum level with taxonomic similarity under 80% … Is this normal? I read in the 454SOP that the classification has a cutoff of 80% to label 1=genus level…right? but I still have seqs identified lower than 80% at phylum level, any ideas?

OTUNumber OTUConTaxonomy Phylum Class Order Family Genus
Otu0057 Bacteria 100 ;“Planctomycetes” 63 ;“Planctomycetacia” 63 ;Planctomycetales 63 ;Planctomycetaceae 63 ;unclassified 57
Otu0094 Bacteria 100 ;unclassified 55 ;unclassified 55 ;unclassified 55 ;unclassified 55 ;unclassified 55
Otu0192 Bacteria 100 ;“Planctomycetes” 77 ;“Planctomycetacia” 77 ;Planctomycetales 77 ;Planctomycetaceae 77 ;unclassified 77



I also want to double check if for everything, community membership and structure comparisons, only OTUs present at least 10 times on a sample are considered by default...

Cheers,

Astrid

What command syntax are you running for classify.seqs and the database command?

Hello Pat,

I would like to discuss again with you about the output of classified seqs. When I create.database I see that the taxonomic identification to class (& sometimes phylum) level is many times lower than 80%. Which doesn’t make sense because there is a cutoff of 80% similarity to genus level, right? I wonder how poorly identified seqs create noise in the analysis? any ideas?

OTUNumber com519Fbar1 com519Fbar2 com519Fbar3 com519Fbar5 com519Fbar6 com519bar10 com519bar11 com519bar12 com519bar15 com519bar16 com519bar19 com519bar20 com519bar4 com519bar9 repSeqName repSeq OTUConTaxonomy Phylum Class Order Family Genus
Otu00325 0 10 0 5 2 0 0 0 8 0 0 0 0 0 IYW8NF002F75BE GT-AG-GGG-GCG-A-G-CG-TTGT-CC-GG-AT-TT-A–T-T-G-GGC-GTA—AA-GAGC-TC-G-TA-G-G-C-G-G–C-TC-A-A-C-AA—G-T-C-G—G-CCG-TG-A-AA-GC–CC-GAG-G–CT-C-AA—CC-T-C-GG-GA-C—G-C-C-G-G-T–C--GA-A-A-C-T-G-TTGT-G-G-C–T-A-G-G-G-T-C–C-GG–TA-G-A—G-GA-G-AG-T—GG–AATT-CCC-G-GT–GT-A-GCG-GTGAAA-TG-CGC-AGAT-A-TC-G-GGA–GG-A-AC-A-CC-AG–T--A–GC-GAA-G–G-C–G--G–C-T-CTCTG----G-GC-CG-----GC-A-C-C–GA-CG----CT-GA-GG–A-G-CGA–AA-G-C–TA–GGG-GAG-C-A-AACA–GG-ATTA-G-ATA-C-CC-T-G-GTA-G-T Bacteria(100) “Actinobacteria”(76) Actinobacteria(76) unclassified(76) unclassified(76) unclassified(76)
Otu00339 0 0 0 0 4 0 0 0 12 4 0 0 0 4 IYW8NF002IZ54A GA-AC-CGT-ACG-A-A-CG-TTAT-T-CGG-AA-TC-A–C-T–GGGC-TTA—AA-GAGT-GC-G-TA-G-G-C-G-G–C-TT-G-G-C-AA—G-T-T-G—G-GTG-TG-A-AA-TC–CC-TCG-G–CT-C-AA—CC-G-A-GG-AA-T—T-G-C-G-C-T–C--AA-A-A-C-T-G-CTA–A-G-C–T-T-G-A-G-G-G–A-GA–TA-G-G—G-GT-G-AG-C—GG–AACT-AAT-G-GT–GG-A-GCG-GTGAAA-TG-CGTTG-AT-A-TC-A-TTA–GG-A-AC-A-CC-GG–A--G–GC-GAA-A–G-C–G--G–C-T-CACTG----G-GT-CT-----TT-T-C-T–GA-CG----CT-GA-GG–C-A-CGA–AA-G-C—T-AGGG-GAG-C-G-AACG–GG-ATTA-G-ATA-C-CC-C-G-GTA-G-T Bacteria(100) unclassified unclassified unclassified unclassified unclassified
Otu00378 3 10 0 1 2 0 0 3 1 0 0 0 1 0 IYW8NF002ICOTU GT-AG-GGG-GCA-A-G-CG-TTGT-CC-GG-AT-TC-A–T-T-G-GGC-GTA—AA-GAGC-TC-G-TA-G-G-C-G-G–C-TC-A-G-T-AA—G-T-C-G—G-CCG-TG-A-AA-GC–CC-GAG-G–CT-C-AA—CC-T-C-GG-GA-C—G-C-C-G-G-T–C--GA-T-A-C-T-G-CTGT-G-G-C–T-A-G-G-G-T-C–C-GG–TA-G-A—G-GA-G-AG-T—GG–AATT-CCC-G-GT–GT-A-GCG-GTGAAA-TG-CGC-AGAT-A-TC-G-GGA–GG-A-AC-A-CC-AG–T--A–GC-GAA-G–G-C–G--G–C-T-CTCTG----G-GC-CG-----GT-A-C-C–GA-CG----CT-GA-GG–A-G-CGA–AA-G-C–TA–GGG-GAG-C-A-AACA–GG-ATTA-G-ATA-C-CC-T-G-GTA-G-T Bacteria(100) “Actinobacteria”(67) Actinobacteria(67) unclassified(67) unclassified(67) unclassified(67)
Otu00412 0 0 0 1 0 0 0 0 0 0 1 4 0 13 IYW8NF002G4JOU AG-AG-GGC-TCA-A-G-CG-TTAA-T-CGG-AA-TC-A–C-T–GGGC-TTA—AA-GGGT-CC-G-CA-G-G-C-G-G–G-TT-G-G-C-AA—G-T-A-T—C-GAG-TG-A-AA-TA–CC-ACG-G–CT-C-AA—CC-G-T-GG-AA-C—T-G-C-T-C-G–G--TA-A-A-C-T-G-CCA–A-C-C–T-T-G-A-A-C-A–C-GG–TA-G-G—G-GC-C-AT-C—GG–AACT-CTA-G-GT–GG-A-GCG-GTGAAA-TG-CGT-AGAT-A-TC-T-AGA–GG-A-AC-G-CC-AG–A--G–GC-GAA-G–G-C–G--G–A-T-GGCTG----G-GC-CG-----TT-G-T-T–GA-CG----CT-CA-GG–G-A-CGA–AA-G-C—G-TGGG-TAG-C-G-AACG–GG-ATTA-G-ATA-C-CC-C-G-GTA-G-T Bacteria(100) “Planctomycetes”(79) Phycisphaerae(79) Phycisphaerales(79) Phycisphaeraceae(79) Phycisphaera(79)
Otu00426 0 1 0 10 6 1 1 0 0 0 0 0 0 0 IYW8NF002G89C3 GA-AC-CGT-CCA-A-A-CG-TTAT-T-CGG-AA-TC-A–C-T–GGGC-TTA—AA-GGGT-GC-G-TA-G-G-C-G-G–C-CC-T-G-T-AA—G-T-T-G—G-GTG-TG-A-AA-TC–CC-TCG-G–CT-C-AA—CC-G-A-GG-AA-T—T-G-C-G-C-C–C--AA-T-A-C-T-G-CAG–G-G-C–T-A-G-A-G-G-G–A-GA–CA-G-A—G-GT-G-AG-C—GG–AACT-TGT-G-GT–GG-A-GCG-GTGAAA-TG-CGT-TGAT-A-TC-A-CAA–GG-A-AC-A-CC-TG–T--G–GC-GAA-A----G-CG–G--C-T-CACTG----G-GT-CT-----TT-T-C-T–GA-CG----CT-GA-GG–C-A-CGA–AA-G-C—T-GGGG-GAG-C-G-AACG–GG-ATTA-G-ATA-C-CC-C-G-GTA-G-T Bacteria(100) “Planctomycetes”(58) “Planctomycetacia”(58) Planctomycetales(58) Planctomycetaceae(58) unclassified(58)
Otu00447 0 1 0 4 0 3 2 0 0 0 0 2 6 0 IYW8NF002G6DIG GA-AC-CGT-ACG-A-A-CG-TTAT-T-CGG-AA-TC-A–C-T–GGGC-TTA—AA-GAGT-GC-G-TA-G-G-C-G-G–C-TT-G-G-C-AG—G-T-T-G—G-GTG-TG-A-AA-GC–CC-TCG-G–CT-C-AA—CC-G-A-GG-AA-T—T-G-C-G-C-C–C--AA-A-A-C-C-G-CCA–A-G-C–T-T-G-A-G-G-G–A-GA–TA-G-A—G-GT-G-AG-C—GG–AACT-AAT-G-GT–GG-A-GCG-GTGAAA-TG-CGT-TGAT-A-TC-A-TTA–GG-A-AC-A-CC-GG–T--G–GC-GAA-A----G-CG–G--C-T-CACTG----G-GT-CT-----CT-T-C-T–GA-CG----CT-GA-GG–C-A-CGA–AA-G-C—T-AGGG-GAG-C-G-AACG–GG-ATTA-G-ATA-C-CC-C-G-GTA-G-T Bacteria(100)
unclassified(67) unclassified(67) unclassified(67) unclassified(67) unclassified(67)
Otu00561 0 9 0 3 2 0 0 0 0 0 0 0 0 0 IYW8NF002HK82E GG-AG-GGT-GCG-A-G-CG-TTAA-T-CGG-AA-TC-A–C-T–GGGC-GTA—AA-GAGC-GC-G-TA-G-G-T-G-G–T-CT-G-A-T-TA—G-T-C-G—G-ATG-TG-A-AA-GC–CC-TAG-G–CT-C-AA—CC-T-A-GG-AA-C—T-G-C-A-T-T–C--GA-T-A-C-T-G-TCA–G-G-C–T-T-G-A-G-T-A–T-GG–GA-G-A—G-GG-A-AG-C—GG–AATT-CCC-G-GT–GT-A-GCG-GTGAAA-TG-CGT-AGAT-A-TC-G-GGA–GG-A-AC-A-CC-AG–T--G–GC-GAA-G–G-C–G--G–C-T-TCCTG----G-CC-CA-----AT-A-C-T–GA-CA----CT-GA-GG–C-G-CGA–AA-G-C—G-TGGG-GAG-C-A-AACA–GG-ATTA-G-ATA-C-CC-T-G-GTA-G-T Bacteria(100) “Proteobacteria”(79) unclassified(58) unclassified(58) unclassified(58) unclassified(58)


You asked me for this [quote="pschloss"] What command syntax are you running for classify.seqs and the database command? [/quote] when we first talked about it:

To classify seqs after i remove lineage I run this:

mothur > classify.seqs(fasta=3sites.shhh.trim.unique.good.filter.unique.precluster.pick.fasta, name=3sites.shhh.trim.unique.good.filter.unique.precluster.pick.names, group=3sites.shhh.good.pick.groups, template=/home/zm1/Mothur.cen/Trainset9_032012.pds/trainset9_032012.pds.fasta, taxonomy=/home/zm1/Mothur.cen/Trainset9_032012.pds/trainset9_032012.pds.tax, cutoff=80)

To create the database:

mothur > get.oturep(list=3sitesall.final.woCyano.an.list, label=0.03, fasta=3sitesall.final.woCyano.fasta, column=3sitesall.final.woCyano.dist, name=3sitesall.final.woCyano.names)
********************###########
Reading matrix: |||||||||||||||||||||||||||||||||||||||||||||||||||


0.03 5860

Output File Names:
3sitesall.final.woCyano.an.0.03.rep.names
3sitesall.final.woCyano.an.0.03.rep.fasta


mothur > classify.otu(list=3sitesall.final.woCyano.an.list, name=3sitesall.final.woCyano.names, taxonomy=3sitesall.final.woCyano.taxonomy, label=0.03) reftaxonomy is not required, but if given will keep the rankIDs in the summary file static. 0.03 5860

Output File Names:
3sitesall.final.woCyano.an.0.03.cons.taxonomy
3sitesall.final.woCyano.an.0.03.cons.tax.summary


mothur > create.database(list=3sitesall.final.woCyano.an.list, label=0.03, repfasta=3sitesall.final.woCyano.an.0.03.rep.fasta, repname=3sitesall.final.woCyano.an.0.03.rep.names, constaxonomy=3sitesall.final.woCyano.an.0.03.cons.taxonomy, group=3sitesall.final.woCyano.groups)

Output File Names:
3sitesall.final.woCyano.an.database

THANKS!!!

The 80% cutoff threshold refers to the bootstrap confidence scores that indicate something about the reliability of the classification. They don’t relate to the taxonomic levels (there’s really no distance based relationship with taxonomic levels).

Pat