aggregating data

Hi,

I am new to analyzing microbiome data, so this might be a bit basic…

I aligned my sequences using Silva. This is what I get:

=========
sample_4 sample_5 sample_6 sample_7 sample_11
Actinobacteria 159 714 769 910 387
Actinobacteria(85) 0 0 0 0 0
Bacteria_unclassified 0 49 24 29 202
Bacteria_unclassified(90) 0 0 0 0 0
Bacteria_unclassified(95) 0 0 0 0 0
.
.

Is there a way that I can collapse/aggregate this further so that all Actinobacteria are in one row, all Bacteria in the second row etc.

many thanks!

You can use the phylotype command to group sequences at a specified taxonomic rank (phylum, class, …, genus).

I used the following commands:

mothur > phylotype(taxonomy=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy)

However, now I only get ~ 180 OTUs! Previously, I was getting about 2,200 although many of them were ‘_unclassified’. Actually, out of 2216 OTUs, 460 were “_unclassified” at the famil/genus level. Is this high?

Am I still doing something wrong here?

Hi PCA,

Your output doesn’t add up to me - it doesn’t look like mothur output. Can you post the actual mothur commands you are running?

Pat

Hi Pat,

A bit long :frowning: but here it is:

####***********

  1. Combine reads:

mothur > make.contigs(file=jpk.stability.files,processors=8)

97 522
98 57349
99 12908

Total of all groups is 4961319

Output File Names:
jpk.stability.trim.contigs.fasta
jpk.stability.trim.contigs.qual
jpk.stability.contigs.report
jpk.stability.scrap.contigs.fasta
jpk.stability.scrap.contigs.qual
jpk.stability.contigs.groups

[WARNING]: your sequence names contained ‘:’. I changed them to ‘_’ to avoid problems in your downstream analysis.

####***********
2. Look at summary stats

mothur > summary.seqs(fasta=jpk.stability.trim.contigs.fasta)

.
.
Using 8 processors.
summary.seqs(fasta=jpk.stability.trim.contigs.fasta)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 244 244 0 2 1
2.5%-tile: 1 252 252 0 4 124033
25%-tile: 1 253 253 0 4 1240330
Median: 1 253 253 0 4 2480660
75%-tile: 1 253 253 0 5 3720990
97.5%-tile: 1 253 253 9 6 4837287
Maximum: 1 502 502 97 251 4961319
Mean: 1 253.055 253.055 0.900051 4.48268

of Seqs: 4961319

Output File Names:
jpk.stability.trim.contigs.summary

It took 55 secs to summarize 4961319 sequences.


####*********** 3. mothur > screen.seqs(fasta=jpk.stability.trim.contigs.fasta, group=jpk.stability.contigs.groups, maxambig=0, maxlength=275)

.
.
Processing sequence: 620000
Processing sequence: 620100
Processing sequence: 620128

Output File Names:
jpk.stability.trim.contigs.good.fasta
jpk.stability.trim.contigs.bad.accnos
jpk.stability.contigs.good.groups


It took 66 secs to screen 4961319 sequences.

####***********
4.

mothur > unique.seqs(fasta=jpk.stability.trim.contigs.good.fasta)

.
.
3791000 257984
3792000 258022
3792821 258046

Output File Names:
jpk.stability.trim.contigs.good.names
jpk.stability.trim.contigs.good.unique.fasta


####*********** 5.

mothur > get.current()


####*********** 6. mothur > count.seqs(name=jpk.stability.trim.contigs.good.names, group=jpk.stability.contigs.good.groups)

Using 8 processors.
It took 19 secs to create a table for 3792821 sequences.


Total number of sequences: 3792821

Output File Names:
jpk.stability.trim.contigs.good.count_table


####***********
  1. summary.seqs(count=jpk.stability.trim.contigs.good.count_table)

Using jpk.stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 8 processors.



Start End NBases Ambigs Polymer NumSeqs Minimum: 1 250 250 0 3 1 2.5%-tile: 1 252 252 0 4 94821 25%-tile: 1 253 253 0 4 948206 Median: 1 253 253 0 4 1896411 75%-tile: 1 253 253 0 5 2844616 97.5%-tile: 1 253 253 0 6 3698001 Maximum: 1 273 273 0 9 3792821 Mean: 1 252.8 252.8 0 4.48627 # of unique seqs: 258046 total # of seqs: 3792821

Output File Names:
jpk.stability.trim.contigs.good.unique.summary

It took 4 secs to summarize 3792821 sequences.

####***********
8.


mothur > pcr.seqs(fasta=/Users/pca/Documents/microbiome/jpk/silva.bacteria/silva.bacteria.fasta, start=11894, end=25319, keepdots=F, processors=8)

.
.
Processing sequence: 1870
Processing sequence: 1869

Output File Names:
/Users/pca/Documents/microbiome/jpk/silva.bacteria/silva.bacteria.pcr.fasta


It took 27 secs to screen 14956 sequences.
####*********** 9.

mothur > system(cp /Users/pca/Documents/microbiome/jpk/silva.bacteria/silva.bacteria.pcr.fasta /Users/pca/Documents/microbiome/jpk/silva.bacteria/silva.v4.fasta)


mothur > summary.seqs(fasta=/Users/pca/Documents/microbiome/jpk/silva.bacteria/silva.v4.fasta)

.
.
Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 13424 270 0 3 1
2.5%-tile: 1 13425 292 0 4 374
25%-tile: 1 13425 293 0 4 3740
Median: 1 13425 293 0 4 7479
75%-tile: 1 13425 293 0 5 11218
97.5%-tile: 1 13425 294 1 6 14583
Maximum: 3 13425 351 5 9 14956
Mean: 1.00074 13425 292.977 0.0573014 4.57014

of Seqs: 14956

Output File Names:
/Users/pca/Documents/microbiome/jpk/silva.bacteria/silva.v4.summary

It took 1 secs to summarize 14956 sequences.

####***********
10.

mothur > align.seqs(fasta=jpk.stability.trim.contigs.good.unique.fasta, reference=/Users/pca/Documents/microbiome/jpk/silva.bacteria/silva.v4.fasta)
.
.
32256
32264
32256
[WARNING]: Some of your sequences generated alignments that eliminated too many bases, a list is provided in jpk.stability.trim.contigs.good.unique.flip.accnos. If you set the flip parameter to true mothur will try aligning the reverse compliment as well.
It took 192 secs to align 258046 sequences.


Output File Names: jpk.stability.trim.contigs.good.unique.align jpk.stability.trim.contigs.good.unique.align.report jpk.stability.trim.contigs.good.unique.flip.accnos
####*********** 11.

summary.seqs(fasta=jpk.stability.trim.contigs.good.unique.align, count=jpk.stability.trim.contigs.good.count_table)


Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 1 1 0 1 1
2.5%-tile: 1968 11550 252 0 4 94821
25%-tile: 1968 11550 253 0 4 948206
Median: 1968 11550 253 0 4 1896411
75%-tile: 1968 11550 253 0 5 2844616
97.5%-tile: 1968 11550 253 0 6 3698001
Maximum: 13425 13425 273 0 9 3792821
Mean: 1968.01 11550 252.799 0 4.48626

of unique seqs: 258046

total # of seqs: 3792821

Output File Names:
jpk.stability.trim.contigs.good.unique.summary

It took 68 secs to summarize 3792821 sequences.



####*********** 12.

screen.seqs(fasta=jpk.stability.trim.contigs.good.unique.align, count=jpk.stability.trim.contigs.good.count_table, summary=jpk.stability.trim.contigs.good.unique.summary, start=1968, end=11550, maxhomop=8)

.
.
Processing sequence: 32256
Processing sequence: 32200
Processing sequence: 32256

Output File Names:
jpk.stability.trim.contigs.good.unique.good.summary
jpk.stability.trim.contigs.good.unique.good.align
jpk.stability.trim.contigs.good.unique.bad.accnos
jpk.stability.trim.contigs.good.good.count_table


It took 183 secs to screen 258046 sequences.

###***
13. mothur > summary.seqs(fasta=current, count=current)

Using jpk.stability.trim.contigs.good.good.count_table as input file for the count parameter.
Using jpk.stability.trim.contigs.good.unique.good.align as input file for the fasta parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1233 11550 250 0 3 1
2.5%-tile: 1968 11550 252 0 4 94782
25%-tile: 1968 11550 253 0 4 947816
Median: 1968 11550 253 0 4 1895632
75%-tile: 1968 11550 253 0 5 2843448
97.5%-tile: 1968 11550 253 0 6 3696482
Maximum: 1968 13424 273 0 8 3791263
Mean: 1968 11550 252.8 0 4.48624

of unique seqs: 257356

total # of seqs: 3791263

Output File Names:
jpk.stability.trim.contigs.good.unique.good.summary

It took 65 secs to summarize 3791263 sequences.



####*********** 14. mothur > filter.seqs(fasta=jpk.stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
. . 31900 32000 32100 32170

Length of filtered alignment: 501 Number of columns removed: 12924 Length of the original alignment: 13425 Number of sequences used to construct filter: 257356

Output File Names:
jpk.filter
jpk.stability.trim.contigs.good.unique.good.filter.fasta

This means that our initial alignment was 13425 columns wide and that we were able to remove 12924 terminal gap characters using trump=. and vertical gap characters using vertical=T. The final alignment length is 501 columns. Because we’ve perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs:

####***********
15.
mothur > unique.seqs(fasta=jpk.stability.trim.contigs.good.unique.good.filter.fasta, count=jpk.stability.trim.contigs.good.good.count_table)

.
.
256000 255807
257000 256806
257356 257162

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.count_table
jpk.stability.trim.contigs.good.unique.good.filter.unique.fasta


####*********** 16. mothur > pre.cluster(fasta=jpk.stability.trim.contigs.good.unique.good.filter.unique.fasta, count=jpk.stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
. . jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.97.map jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.98.map jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.99.map

####*********** 17.

mothur > chimera.uchime(fasta=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

.
.
It took 12 secs to check 1144 sequences from group 74.
uchime v4.2.40
by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.

00:00 835.6 100.0% Reading jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.temp5588.temp
00:00 847.9 302 sequences
00:01 1.6Mb 100.0% 85/301 chimeras found (28.1%)

It took 1 secs to check 302 sequences from group 75.

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.chimeras
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos


####*********** 18.

mothur > remove.seqs(fasta=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)


[WARNING]: This command can take a namefile and you did not provide one. The current namefile is jpk.stability.trim.contigs.good.names which seems to match jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta. Removed 32751 sequences from your fasta file.

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta


####*********** 19. mother > summary.seqs(fasta=current, count=current)

Using jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table as input file for the count parameter.
Using jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta as input file for the fasta parameter.

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 501 241 0 3 1
2.5%-tile: 1 501 252 0 4 89009
25%-tile: 1 501 253 0 4 890086
Median: 1 501 253 0 4 1780172
75%-tile: 1 501 253 0 5 2670258
97.5%-tile: 1 501 253 0 6 3471335
Maximum: 1 501 258 0 8 3560343
Mean: 1 501 252.802 0 4.47901

of unique seqs: 38520

total # of seqs: 3560343

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.summary

It took 0 secs to summarize 3560343 sequences.


####*********** 20.

mothur > classify.seqs(fasta=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=/Users/pca/Documents/microbiome/jpk/RDP_ref/trainset14_032015.pds/trainset14_032015.pds.fasta, taxonomy=/Users/pca/Documents/microbiome/jpk/RDP_ref/trainset14_032015.pds/trainset14_032015.pds.tax, cutoff=80)


. . Processing sequence: 38400 Processing sequence: 38500 Processing sequence: 38520

It took 306 secs to classify 38520 sequences.


It took 3 secs to create the summary file for 38520 sequences.
Output File Names: jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.tax.summary

####***********
21.

Now that everything is classified we want to remove our undesirables. We do this with the remove.lineage command:

mothur > remove.lineage(fasta=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.


Output File Names: jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table

####*********** 22.
mothur > summary.seqs(fasta=current, count=current)
Using jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table as input file for the count parameter. Using jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta as input file for the fasta parameter.

Using 1 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 501 241 0 3 1
2.5%-tile: 1 501 252 0 4 88907
25%-tile: 1 501 253 0 4 889066
Median: 1 501 253 0 4 1778132
75%-tile: 1 501 253 0 5 2667197
97.5%-tile: 1 501 253 0 6 3467356
Maximum: 1 501 258 0 8 3556262
Mean: 1 501 252.8 0 4.47842

of unique seqs: 38423

total # of seqs: 3556262

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.summary

It took 1 secs to summarize 3556262 sequences.



####*********** 23.

mothur > dist.seqs(fasta=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20)

.
.
38300 2680
38400 2691
38422 2695

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist

It took 2695 seconds to calculate the distances for 38423 sequences.

cluster(column=stinki_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=stinki_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table)

######**********
24.

mothur > cluster(column=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)

.
.
0 0 0 0 1
changed cutoff to 0.075503
It took 17100 seconds to cluster

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list


###************ 25.

mothur > make.shared(list=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=0.03)

.
.
0.03

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.shared



###**** OTUS 26.

We probably also want to know the taxonomy for each of our OTUs. We can get the consensus taxonomy for each OTU using the classify.otu command:

mothur > classify.otu(list=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, label=0.03)

0.03 2216

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.tax.summary


####***** PHYLOTYPES 27.
mothur > phylotype(taxonomy=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy)
1 2 3 4 5 6

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.tx.sabund
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.tx.rabund
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.tx.list


######********* 28. mothur > make.shared(list=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.tx.list, count=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, label=1)

1

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.tx.shared



We also want to know who these OTUs are and can run classify.otu on our phylotypes:

mothur > classify.otu(list=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.tx.list, count=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table, taxonomy=jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy, label=1)

.
.
1 185

Output File Names:
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.tx.1.cons.taxonomy
jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.tx.1.cons.tax.summary

What do the first lines of jpk.stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy look like?

I’m still not sure where you were getting the first table you had in your first post. If you sort the output of the *wang.pick.taxonomy file by the taxlevel column, I think you’ll get what you want. I’m concerned, because you seem to have the confidence scores with the taxa names, which is not correct.

Pat