Issues subsampling data

we are having a problem obtaining representative sequences which can be blasted in order to check unclassified OTU taxonomic classifications. we have been using the following commands to process our data and obtain our OTU table

#sffinfo(sff=101712KM28F.sff, flow=T)
#trim.flows(flow=101712KM28F.flow, oligos=am_reads.oligos, pdffs=2, bdiffs=1, processors=10) #shhh.flows(file=101712KM28F.flow.foles, processors=11) summary.seqs(fasta=101712KM28F.shhh.fasta, name=101712KM28F.shhh.names) unique.seqs(fasta=101712KM28F.shhh.fasta, name=101712KM28F.shhh.names) summary.seqs(fasta=101712KM28F.shhh.unique.fasta, name=101712KM28F.shhh.unique.names)
align.seqs(fasta=101712KM28F.shhh.unique.fasta, reference=silva.bacteria.fasta, processors=4, flip=t) summary.seqs(fasta=101712KM28F.shhh.unique.align, name=101712KM28F.shhh.unique.names)
merge.files(input=101712KM28F.AMB1.28f.shhh.groups-101712KM28F.AMB10.28f.shhh.groups-101712KM28F.AMB11.28f.shhh.groups-101712KM28F.AMB12.28f.shhh.groups-101712KM28F.AMB14.28f.shhh.groups-101712KM28F.AMB15.28f.shhh.groups-101712KM28F.AMB5.28f.shhh.groups-101712KM28F.AMB6.28f.shhh.groups-101712KM28F.AMB7.28f.shhh.groups-101712KM28F.AMB9.28f.shhh.groups-101712KM28F.AMC10.28f.shhh.groups-101712KM28F.AMC11.28f.shhh.groups-101712KM28F.AMC12.28f.shhh.groups-101712KM28F.AMC13.28f.shhh.groups-101712KM28F.AMC15.28f.shhh.groups-101712KM28F.AMC3.28f.shhh.groups-101712KM28F.AMC5.28f.shhh.groups-101712KM28F.AMC6.28f.shhh.groups-101712KM28F.AMC8.28f.shhh.groups-101712KM28F.AMC9.28f.shhh.groups, output=101712KM28F.shhh.groups) screen.seqs(fasta=101712KM28F.shhh.unique.align, name=101712KM28F.shhh.unique.names, group=101712KM28F.shhh.groups, start=1044, optimize=end, criteria=95, processors=4) filter.seqs(fasta=101712KM28F.shhh.unique.good.align, vertical=T, trump=., processors=4) summary.seqs(fasta=101712KM28F.shhh.unique.good.filter.fasta, name=101712KM28F.shhh.unique.good.names)
unique.seqs(fasta=101712KM28F.shhh.unique.good.filter.fasta, name=101712KM28F.shhh.unique.good.names)
pre.cluster(fasta=101712KM28F.shhh.unique.good.filter.unique.fasta, name=101712KM28F.shhh.unique.good.filter.names, group=101712KM28F.shhh.good.groups, diffs=2) chimera.uchime(fasta=101712KM28F.shhh.unique.good.filter.unique.precluster.fasta, name=101712KM28F.shhh.unique.good.filter.unique.precluster.names, group=101712KM28F.shhh.good.groups, processors=4) remove.seqs(accnos=101712KM28F.shhh.unique.good.filter.unique.precluster.uchime.accnos, fasta=101712KM28F.shhh.unique.good.filter.unique.precluster.fasta, name=101712KM28F.shhh.unique.good.filter.unique.precluster.names, group=101712KM28F.shhh.good.groups, dups=T)
classify.seqs(fasta=101712KM28F.shhh.unique.good.filter.unique.precluster.pick.fasta, name=101712KM28F.shhh.unique.good.filter.unique.precluster.pick.names, group=101712KM28F.shhh.good.pick.groups, template=97_otus.fasta, taxonomy=97_otu_taxonomy_format.txt, cutoff=80, processors=4) remove.lineage(fasta=101712KM28F.shhh.unique.good.filter.unique.precluster.pick.fasta, name=101712KM28F.shhh.unique.good.filter.unique.precluster.pick.names,, taxon=Mitochondria-Eukaryota-Chloroplast-unknown, group=101712KM28F.shhh.good.pick.groups)
summary.seqs(fasta=101712KM28F.shhh.unique.good.filter.unique.precluster.pick.pick.fasta, name=101712KM28F.shhh.unique.good.filter.unique.precluster.pick.pick.names)
system(cp 101712KM28F.shhh.unique.good.filter.unique.precluster.pick.pick.fasta am_final.fasta)
system(cp 101712KM28F.shhh.unique.good.filter.unique.precluster.pick.pick.names am_final.names)
system(cp 101712KM28F.shhh.good.pick.pick.groups am_final.groups)
system(cp am_final.taxonomy)
dist.seqs(fasta=am_final.fasta, cutoff=0.15, processors=2)
cluster(column=am_final.dist, name=am_final.names)
make.shared(, group=am_final.groups, label=0.03)
sub.sample(, name=am_final.names, size=1426)
classify.otu(, name=am_final.names, taxonomy=am_final.taxonomy, label=0.03)
dist.seqs(fasta=am_final.fasta, output=phylip, processors=2)
get.oturep(phylip=am_final.phylip.dist, fasta=am_final.fasta, name=am_final.names,, group=am_final.groups)
system(…/…/…/…/programs/pprospector-1.0.1/scripts/ -f

the output from the subsample and the get.otu rep seems to me to not correspond to each other when we blast the output from get.oturep our results don't correspond to the assigned taxonomy in the am_final.taxonomy file or the sub.sample shared file ( we aren't doubting that the assignments in mother are correct, we just cant seem to figure out a way of effectively subsampling our data, then getting representative sequences for each out that correspond to each other. so many of our sequences have little to no taxonomic classifications, we need to look at some individual blast results to even get a feel for what we are dealing with

thank you from two very frustrated users
Patrick and Kathy

sub.sample(, name=am_final.names, size=1426) 
classify.otu(, name=am_final.names, taxonomy=am_final.taxonomy, label=0.03) 
dist.seqs(fasta=am_final.fasta, output=phylip, processors=2) 
get.oturep(phylip=am_final.phylip.dist, fasta=am_final.fasta, name=am_final.names,, group=am_final.groups)

So you are only subsampling your shared and names file and your following files need a list file, name file, taxonomy file, fasta file and group file. All of those should be included in the sub.sample command and the subsampled files should be used in the subsequent commands.

Thanks for your quick response Pat,
one thing that confuses me regarding this command is that if I declare the list file and the fasta file in the command, I get an error stating a new group or count file can only be made from the subsample of a listfile or fasta file, not both. please correct.

which should I be using in the sub.sample command and then what alternate fasta or list file should I be using downstream,
eg if I declare fasta in the subsample command, what list file should I be using in classify OTU and Get OTUrep?
alternatively if I declare the listfile in the subsample command what is the correct fasta file that I should be using in the dist.seqs and get.oturep command,
or am I doing something wrong using these commands downstream?

You can sample the list file and then use the get.seqs command to select the subsampled names from the other files.

sub.sample(list=yourListFile, size=1426) - subsample from the list file that contains all the sequences
list.seqs(list=current) - list seqs in subsample
get.seqs(fasta=yourFastaFile, name=yourNameFile, group=yourGroupFile, taxonomy=yourTaxonomy, accnos=current) - select subsampled seqs from all files
dist.seqs(fasta=current, output=phylip) - get distance matrix for subsampled seqs
get.oturep(phylip=current, fasta=current, list=current, group=current, name=current) - otu reps for subsample
classify.otu(list=current, name=current, taxonomy=current) - classifications for subsampled otus.

hi, sorry to bug you again
i have tried your suggestion, but in order to get the shared file i have had to also add a make.shared call


however, i run into another problem. making my final classify.otu call there seems to be an issue with my taxonomy file. there are several warnings printed stating XXXXXXXXXXXX is not in your taxonomy file, i will not include it in the consensus. this becomes apparent when loooking at the resulting taxonomy file that shows many of the reads have no corresponding taxonomy value (i would say over 80% of these in the taxonomy file are called Unclassified

as an example

OTU Size Taxonomy
Otu001 215 Bacteria(100);Proteobacteria(100);Gammaproteobacteria(100);Oceanospirillales(100);Endozoicimonaceae(100);unclassified(100);unclassified(100);
Otu002 0 unclassified;
Otu003 0 unclassified;
Otu004 0 unclassified;
Otu005 38 Bacteria(100);Proteobacteria(100);Gammaproteobacteria(100);Oceanospirillales(100);Endozoicimonaceae(100);unclassified(100);unclassified(100);
Otu006 0 unclassified;
Otu007 0 unclassified;
Otu008 0 unclassified;
Otu009 0 unclassified;
Otu010 0 unclassified;
Otu011 0 unclassified;
Otu012 0 unclassified;
Otu013 0 unclassified;
Otu014 0 unclassified;
Otu015 0 unclassified;

as i understand it, this taxonomy file we used in the call is the result of subsampling, so i thought the previous taxonomy file might work, as it should contain more (dare i say all) the taxonomy entries corresponding to the original fasta file/names file

but apparently not, because even more of these warnings are produced and no additional taxonomic classifications are made in this resulting cons.taxonomy file.

sorry to hassle you again, but i have tried all the options i can think of and to no avail

I think the issue is because the wrong name file is being used. Instead of current, you want am_final.pick.names. Can you try this:

classify.otu(list=am_final.fn.subsample.pick.list, name=am_final.pick.names, taxonomy=am_final.pick.taxonomy)

Had teh same issue with needing to subsample list,names and fasta files all.
went with the solution provided here.
sub.sample(, size=27808)
get.seqs(,,,, accnos=current) - select subsampled seqs from all files
split.abund(,,,, cutoff=10, label=0.05)

All nice and well but …

make.shared(,, label=0.05)
[ERROR]: DBNW5DQ1_84_B04B8ABXX_5_1101_10000_118110_1_N_0__76bp_76.0_0.84_I10 is in your groupfile and not your listfile. Please correct.
[ERROR]: DBNW5DQ1_84_B04B8ABXX_5_1101_10000_123534_1_N_0__77bp_77.0_0.90_I17 is in your groupfile and not your listfile. Please correct.
[ERROR]: DBNW5DQ1_84_B04B8ABXX_5_1101_10000_15067_1_N_0__75bp_75.0_0.84_I14 is in your groupfile and not your listfile. Please correct.

So what’s the deal. Shouldn’t the list and group and all those files be now subsampled in a same way? And if not how can I make it so?

You included the names file, and the get.seqs command has a dups parameter. By default, dups=t. When dups=t, if any sequences in the line in the names file is selected then all the sequences in that line are selected. The get.seqs command was selecting a lot more sequences than were in your list file.

get.seqs(,,,, accnos=current, dups=f) - select subsampled seqs from all files

Note: If you include the group file on the subsample command mothur will subsample it for you and the files will match.

Thank you missed the dups part. It was very helpful.

Hmm ok something I do not understand.
I was thinking to stop messing with the list file etc. and do the subsample on shared file. Anyway I was in the need to both remove low abundance otus and subsample the data.
split.abund(,,,, cutoff=10)
make.shared(,, label=0.05)

The last command will subsample the shared file but otherwise it’ll produce error.
[ERROR]: your fasta file contains 4340185 sequences, and your groupfile contains 4336041, please correct.

Tried both split.abund and sub.sample with and without names parameter, but no matter what combination these files seem to be out of sync.
And they do get out of sync in the split.abund stage -
because (from deunique.seqs(, has 4340185 sequences and has also 4340185 sequences.

The problem is your names and taxonomy files in the subsample do not match your groups and fasta files. Try this:

split.abund(,,,, cutoff=10)
make.shared(,, label=0.05)
get.seqs(dups=f,, accnos=current) - select abundant names
get.seqs(, accnos=current) - select abundant names

Hi Sarah (or Pat)

I posted my problem in other threads and then found this one and thought it will may be solve the problem I had with the group file from split.abund and the phylogeny approach for betadiversity.

I tried the list of commands suggested but again have a problem with the group file. Could you please tell me what I am doing wrong? Below the steps and outputs after first running dist.seqs and cluster with my final files and some questions in between about each step:

split.abund(, fasta=final4.fasta, name=final4.names, group=final4.groups, cutoff=3)
#will remove seqs or otus? what the cutoff does here?

make.shared(, group=final4.0.03.abund.groups, label=0.03) #Make a shared file with only the abund OTUs for all samples. The output file has 4106 OTUs with 4 or more seqs.
list.seqs(group=final4.0.03.abund.groups) #Gave an accnos file listing 184288 sequences. The "unique" seqs in the abund.fasta file are 38482. What is listed here??
get.seqs(dups=f, name=final4.names, accnos=final4.0.03.abund.accnos) #selects the names of the unique abund sequences? The wiki says: The dups parameter is only be used in tandem with a namefile. By default dups is true, so if any sequence in a specific line in the names file is in your .accnos file, then all sequences in that line will be kept. This is especially useful when used with the groupfile, since for most commands your files can contain only the unique sequences, but the groupfile need to contain all the sequences in your namefile. I'm afraid I don't get the point here, as the split.abund removes the seqs that appear 1, 2 or 3 times (cutoff=3), how is that possible that one of these seqs belong to a "cluster" or more seqs so being needed to use dups=f?. Why not to continue working with the abund fasta, list and group files?? Is all this to have an "abund" names file that is not made by the split.abund command?)

#Selects the name of the 38482 seqs in the abund fasta file?

get.seqs(taxonomy=final4.taxonomy, names=final4.pick.names, accnos=final4.0.03.abund.accnos) #Selected 38482 sequences from my taxonomy file. #Now the taxonomy file have the classification of the abundant seqs only?
sub.sample(, fasta=final4.0.03.abund.fasta, name=final4.pick.names, group=final4.0.03.abund.groups, taxonomy=final4.pick.taxonomy)

Sampling 5278 from each group.
Sampling 5278 from 184288.
Deconvoluting subsampled fasta file…
Running command: unique.seqs(fasta=final4.0.03.abund.subsample.fasta)
1000 245
2000 675
3000 1223
4000 2000
5000 2884
5278 3074

Output File Names:

Output File Names: #(with 4105 OTUs and 142506 seqs)
final4.pick.subsample.names #(with the names of the seqs represented by each of the 3074 unique seqs)
final4.pick.subsample.taxonomy #(with the classification of 3074 seqs)
final4.0.03.abund.subsample.fasta #(with 3074 seqs)
final4.0.03.abund.subsample.groups #(5278 seqs in total, NOT 5278 for each sample!)

10A_11 contains 215.
11A_10 contains 197.
11B_10 contains 169.
11C_10 contains 176.
12A_10 contains 174.
12B_10 contains 254.
1A_11 contains 203.
1B_11 contains 203.
2A_11 contains 234.
2B_11 contains 221.
3B_11 contains 180.
4A_11 contains 152.
4B_11 contains 158.
5A_10 contains 168.
5A_11 contains 180.
6A_11 contains 245.
6B_11 contains 206.
7A_10 contains 184.
7A_11 contains 256.
7B_10 contains 244.
8A_10 contains 186.
8A_11 contains 185.
8B_10 contains 155.
9A_10 contains 180.
9A_11 contains 179.
S1B1S_10 contains 168.
S1B1S_11 contains 206.

Total seqs: 5278

Of course all the OTU based analysis worked but again the phylogenetic approach didn’t, as instead of working with a fasta and group files representing 5278 seqs per sample, the obtained files represented 5278 seqs in total.

Could you help me with this one please??

Thank you


So I don’t buy into the argument that rare OTUs should be removed. Garbage can be abundant and good things can be rare. Seems to be that people that do remove rare OTUs are making a lot of unsupported assumptions. But if you insist, why not use filter.shared along with sub.sample.

I know Pat your opinion on removing rare OTUs (and I also think that is better not to do it, or at least when you know that for your analysis purpose from the microbiological and not the statistical point of view the rare are important) but I needed to find a way to compare mine with other people’s analysis. I tried what you say about using filter.shared but then I could not get fasta and group files matching the filtered shared file to build a tree and do unifrac based pcoa on the same dataset than alphadiversity was done (the filtered shared file). That is why I asked if there is a way to have filtered fasta, group, name and taxonomy files matching all together, in which the sequences that cluster alone or in 2-3 members OTUs are not present.