Preparing files for picrust with mothur

Dear mothur enthusiasts,

A colleague advised me to try and run picrust on my dataset in order to get some functional inference from 16S sequences. He told me to “classify the OTU on GreenGenes” and prepare a biom file.

I have tried to follow the commands listed in an earlier forum post (Mothur file to biom), continuing from the fasta and count_table from the “classical” analysis (mothur 1.48 under Windows 10 Pro):

classify.seqs(fasta=16Sn_final.fasta, count=16Sn_final.count_table, reference=gg_13_5_99.fasta, taxonomy=gg_13_5_99.gg.tax, cutoff=80)

It took 5 secs to create the summary file for 47073 sequences.

[…] [WARNING]: M02352_43_000000000-J98TB_1_2116_20022_7314 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences. […]

Output File Names:
16Sn_final.gg.wang.taxonomy
16Sn_final.gg.wang.tax.summary

remove.lineage(fasta=16Sn_final.fasta, count=16Sn_final.count_table, taxonomy=16Sn_final.gg.wang.taxonomy, taxon=unknown)

Removed 3 sequences from 16Sn_final.fasta.
Removed 4 sequences from 16Sn_final.count_table.

Output File Names:
16Sn_final.pick.fasta
16Sn_final.pick.count_table

/******************************************/

Output File Names:
16Sn_final.gg.wang.pick.taxonomy
16Sn_final.gg.wang.accnos
16Sn_final.pick.count_table
16Sn_final.pick.fasta

dist.seqs(fasta=16Sn_final.pick.fasta, cutoff=0.03)

Output File Names:
16Sn_final.pick.dist

cluster(column=16Sn_final.pick.dist, count=16Sn_final.pick.count_table)

Output File Names:
16Sn_final.pick.opti_mcc.list
16Sn_final.pick.opti_mcc.steps
16Sn_final.pick.opti_mcc.sensspec

make.shared(list=16Sn_final.pick.opti_mcc.list, count=16Sn_final.pick.count_table)

Output File Names:
16Sn_final.pick.opti_mcc.shared

classify.otu(list=16Sn_final.pick.opti_mcc.list, count=16Sn_final.pick.count_table, taxonomy=16Sn_final.gg.wang.pick.taxonomy)

Output File Names:
16Sn_final.pick.opti_mcc.0.03.cons.taxonomy
16Sn_final.pick.opti_mcc.0.03.cons.tax.summary

make.biom(shared=16Sn_final.pick.opti_mcc.shared, constaxonomy=16Sn_final.pick.opti_mcc.0.03.cons.taxonomy, reftaxonomy=gg_13_5_99.gg.tax, picrust=97_otu_map.txt, label=0.03)

Output File Names:
16Sn_final.pick.opti_mcc.0.03.biom
16Sn_final.pick.opti_mcc.0.03.biom_shared

The forum post mentioned above lists more steps to be performed on the biom file but I have no idea how to do this:

biom validate-table -i crc.0.03.biom…

So, I have uploaded both files from make.biom(…) onto the Huttenhower lab GALAXY instance (and the files appear in green) but it fails at the first step, Normalize By Copy Number.

Dataset generation errors

Dataset 4: Normalize By Copy Number on data 3
Tool execution generated the following error message:

Traceback (most recent call last):
File “/galaxy-central/tools/picrust/scripts/normalize_by_copy_number.py”, line 146, in
main()
File “/galaxy-central/tools/picrust/scripts/normalize_by_copy_number.py”, line 72, in main
otu_table = parse_biom_table(open(opts.input_otu_fp,‘U’))
File “/galaxy_venv/local/lib/python2.7/site-packages/biom/parse.py”, line 323, in parse_biom_table
t = parse_biom_table_str(table_str, constructor=constructor)
File “/galaxy_venv/local/lib/python2.7/site-packages/biom/parse.py”, line 619, in parse_biom_table_str
raise BiomParseException, ‘Unknown table type’
biom.exception.BiomParseException: Unknown table type

Is this due to the fact that I have not performed the additional steps with biom validate…? If they are necessary, how can I do this? Unfortunately, I have visited the biom site on github and could not find any installer.

Thank you for any help!

Yours,
Maxime

Morning!

I suggest to moove to picrust2 first. What I do is to go the ASV way in mothur (but just for picrust). Then I do this to my files in R studio. Then I run picrust2 on that.

library(phylotools)

setwd(“E:/OneDrive - Universite de Montreal/sequences microbiote/campy_fort_faible/picrust2”)

fastatableseqname ← read.fasta(file = “finalng.fasta”, clean_name = FALSE)

fastatableseqname
colnames(fastatableseqname)
fastatableseqname = subset(fastatableseqname, select = c(seq.name,seq.text))
colnames(fastatableseqname)

OTUnumbseqname <-read.table(“finalASV.otu”, header=FALSE, col.names=c(“OTU”, “seq.name”))
colnames(OTUnumbseqname)

mergetable <-merge(fastatableseqname, OTUnumbseqname, by=“seq.name”, all=FALSE)

mergetable
colnames(mergetable)

sharedtable <-read.table(“final.shared”, header=TRUE)
colnames(sharedtable)
rownames(sharedtable)<-sharedtable$Group
rownames(sharedtable)

sharedtablesimple = subset(sharedtable, select = -c(label,Group, numOtus))
colnames(sharedtablesimple)
rownames(sharedtablesimple)

tshared ← as.data.frame(t(sharedtablesimple))
colnames(tshared)
rownames(tshared)

tshared$OTU ← rownames(tshared)

piecrustinput <-merge(mergetable, tshared, by=“OTU”, all=FALSE)

piecrustinput2 <-subset(piecrustinput, select = -c(seq.text,OTU))

colnames(piecrustinput2)
rownames(piecrustinput2)

write.table(piecrustinput2, “piecrustinput2.txt”, sep=“\t”, col.names=TRUE, row.names=FALSE)

Merci! I will try the method you suggest. It will likely take me some time because I am not familiar with R at al…

Aucun problème.

I just saw that there was a file that I did not mentioned how to do. Here are my Mothur command to get all files needed for the analysis.

As for R, well I guess you will need to learn it at least to analyse the output of picrust2.

set.logfile(name=porc_tout_picrust_clustersplit)

set.current(processors=32)

set.seed(seed=100)
make.contigs(file=porc_tout_picrust.files, oligos=primers.oligo.txt, checkorient=t, pdiffs=2, deltaq=5, maxambig=0, maxlength=300, maxhomop=20, processors=32)

unique.seqs(fasta=current, count=current)
summary.seqs(fasta=current, count=current)

screen.seqs(fasta=current, count=current, summary=current, maxambig=0, maxhomop=20)

align.seqs(fasta=current, reference=silva.nr_v132.pcr.align, flip=t, processors=32)
summary.seqs(fasta=current, count=current)
count.groups(count=current)

screen.seqs(fasta=current, count=current, summary=current, start=1968, end=11550, maxambig=0, maxhomop=20)
count.groups(count=current)
summary.seqs(fasta=current, count=current)

filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
count.groups(count=current)
summary.seqs(fasta=current, count=current)

set.seed(seed=100)
pre.cluster(fasta=current, count=current, diffs=4, processors=32, method=unoise)
count.groups(count=current)
summary.seqs(fasta=current, count=current)

chimera.vsearch(fasta=current, count=current, dereplicate=t, processors=32)
count.groups(count=current)
summary.seqs(fasta=current, count=current)

set.seed(seed=100)
classify.seqs(fasta=current, count=current, iters=1000, reference=silva.nr_v138.align, taxonomy=silva.nr_v138.tax, cutoff=75)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Eukaryota)
summary.tax(taxonomy=current, count=current)
summary.seqs(fasta=current, count=current)
count.groups(count=current)

set.seed(seed=100)
cluster(fasta=current, count=current, iters=1000, precision=1000, delta=0, method=unique, processors=32)

make.shared(list=current,count=current,label=asv)

classify.otu(list=current, count=current, taxonomy=current, threshold=75)

degap.seqs(fasta=current)
get.otulist(list=current, label=ASV)

quit()

By the way, I only use ASV route for the downstream picrust analysis. I get slightly better mock community, at least at the genus level, using the OTU route. I started using ASV because I am not able to get the representative sequence out of my OTUs, the run always crash, even on remote server.

Best of success

Bonjour Alexandre,
Hello to all friends of mothur,

I apologize in advance for all those newbie questions… Am I right in my understanding that I have to:

(1) follow the mothur workflow that you provided (“the ASV route”), producing the three files finalng.fasta, finalASV.otu and final.shared,

(2) use those three files as input for the provided R script, producing the file piecrustinput2.txt in your example,

(3) use that file as input for PICRUSt (possible online) / PICRUSt2 (on the command-line),

(4) back to R, analyse the output of PICRUSt / PICRUSt2?

In parallel to this discussion, I have received direct help from Gavin Douglas (one of the authors of PICRUSt2). I sent my biom file to Gavin (the one I had produced with mothur with the aforementioned commands) and he converted it to the right format (with biom-format, I think just as explained by Pat in another forum thread).

Unfortunately, this converted biom file didn’t work either on the PICRUSt instance I tried (version 1 on GALAXY). I have not been able to setup biom-format on my computer (install of Python seems to succeed but pip install biom-format does not). I don’t have access to a Linux workstation right now and thus I don’t have a possibility to try PICRUSt2 either (if I ever manage to install it properly…).

So it’ll take me quite some efforts after the year-end’s break!

Yours,
Maxime

Hello, this is what I am doing to get to picrust2 analysis. Never worked with biom file, I am using in R microbiomemarker to import the files into phyloseq. But you need first to manipulate the files to get to the correct format.

I will tidy this up as soon as I can to provide a real pipeline analysis.