duplicates in subsampled taxonomy files

Johannes · March 19, 2014, 5:11pm

(sorry for changing the topic etc, have a slight fever making everything a bit more confusing)

I’ve read somewhere on the mothur forum that the subsampling procedure is without replacement… but when running the subsample command on a list, taxonomy , names and groups file I get errors which looks like the procedure might have been subsampling the same sequences twice.

This is the error I get when running the classify.otu cmd

mothur > classify.otu(list=Aplysina_fulva_Clone_lib.merged.square.fn.subsample.list, taxonomy=Aplysina_fulva_Clone_lib.merged.silva.wang.subsample.taxonomy, cutoff=80, label=0.030, reftaxonomy=../silva.bacteria/silva.bacteria.silva.tax)
[ERROR]: GU982099 is already in your taxonomy file, names must be unique.
[ERROR]: FM160936 is already in your taxonomy file, names must be unique.

Checking the original files, they only contain one occurrence of either sequence (one for each distance in the case of the list file), so it seems these are subsampled twice…?

Also, when re-running everything, the file which contain duplicates change as well as the identity of the duplicated seqs.

When I subsample, I run;

mothur > sub.sample(list=Aplysina_fulva_Clone_lib.merged.square.fn.list, taxonomy=Aplysina_fulva_Clone_lib.merged.silva.wang.taxonomy, name=Aplysina_fulva_Clone_lib.merged.names, group=Aplysina_fulva_Clone_lib.merged.groups, persample=T, label=0.030)
Sampling 21 from each group.
0.030
Sampling taxonomy and name file... 
/******************************************/
Running command: get.seqs(dups=f, name=Aplysina_fulva_Clone_lib.merged.names, taxonomy=Aplysina_fulva_Clone_lib.merged.silva.wang.taxonomy, accnos=temp.accnos)
Selected 105 sequences from your name file.
Selected 108 sequences from your taxonomy file.

Output File Names: 
Aplysina_fulva_Clone_lib.merged.pick.names
Aplysina_fulva_Clone_lib.merged.silva.wang.pick.taxonomy

/******************************************/
Done.

Output File Names: 
Aplysina_fulva_Clone_lib.merged.square.fn.subsample.list
Aplysina_fulva_Clone_lib.merged.subsample.groups
Aplysina_fulva_Clone_lib.merged.subsample.names
Aplysina_fulva_Clone_lib.merged.silva.wang.subsample.taxonomy


mothur > quit()

Thanks,

Johannes · March 20, 2014, 12:47pm

I’ve tried re-running everything, but end up with the same problem.

The problem is this. Look at the number of sequences getting subsampled.

mothur > sub.sample(list=./Aplysina_fulva_Clone_lib.merged.square.fn.list, taxonomy=./Aplysina_fulva_Clone_lib.merged.silva.wang.taxonomy, name=./Aplysina_fulva_Clone_lib.merged.names, group=./Aplysina_fulva_Clone_lib.merged.groups, persample=T, label=0.030)

Sampling 21 from each group.
0.030
Sampling taxonomy and name file... 
/******************************************/
Running command: get.seqs(dups=f, name=./Aplysina_fulva_Clone_lib.merged.names, taxonomy=./Aplysina_fulva_Clone_lib.merged.silva.wang.taxonomy, accnos=temp.accnos)
Selected 105 sequences from your name file.
Selected 110 sequences from your taxonomy file.

Note that;

Selected 105 sequences from your name file.
Selected 110 sequences from your taxonomy file.

Also note that these numbers change between runs, sometimes it’s 2 seqs sometimes its 5 as shown here (so maybe it has to do with the dups flag?)

This difference translates to 5 duplicated sequences as seen below when running e.g. the classify.otu cmd.

mothur > classify.otu(list=./Aplysina_fulva_Clone_lib.merged.square.fn.subsample.list, taxonomy=./Aplysina_fulva_Clone_lib.merged.silva.wang.subsample.taxonomy, name=./Aplysina_fulva_Clone_lib.merged.subsample.names, group=./Aplysina_fulva_Clone_lib.merged.subsample.groups, cutoff=80, label=0.030, persample=T, reftaxonomy=../silva.bacteria/silva.bacteria.silva.tax)

[ERROR]: FM160913 is already in your taxonomy file, names must be unique.
[ERROR]: GU982075 is already in your taxonomy file, names must be unique.
[ERROR]: GU982078 is already in your taxonomy file, names must be unique.
[ERROR]: FM160937 is already in your taxonomy file, names must be unique.
[ERROR]: FM160941 is already in your taxonomy file, names must be unique.

Am I doing something very wrong here? How can I fix this? 

Thanks,

westcott · March 20, 2014, 1:04pm

How did you merge the files? Looking at the 2 commands you lists the first time mothur is selecting 21 from each group and then 42 from each group. This makes me think perhaps you are merging the same file in twice?

Johannes · March 20, 2014, 2:11pm

Sarah, thanks for replying. Sorry, not the best examples. In the first example, it’s another sample than shown in the second. For clarity, I’ll update it so it shows the same sample. All original fasta and group files have been used in other commands, thus potential duplicates would have appeared before.

westcott · March 20, 2014, 3:09pm

Could you send your log file and input files to mothur.bugs@gmail.com so I can track down the issue for you?

Johannes · March 20, 2014, 3:17pm

absolutely! thanks!

westcott · March 20, 2014, 4:32pm

I found the problem. One of the sequences that becomes a duplicate is in your taxonomy file, so mothur is assuming it is a unique. But it is listed in the names file as a duplicate. Any sequence in column 2 is assumed to be a duplicate. FM160913 is such a sequence. I noticed the word merge in the filename. How did you merge the files?

Johannes · March 20, 2014, 5:41pm

Thanks for looking into it Sarah. However, the merge syntax in the files names is created by me, when concatenating fasta and creating group files (my seqs are not NGS). I used the unique.seqs command to create a names file using my fastas, so this is where the problem is? Why wouldn’t the unique.seqs command create the expected output using my fasta files? Thanks again,

westcott · March 20, 2014, 6:25pm

I am assuming the taxonomy file was created from running classify.seqs on the fasta file. The issue is looking is the fasta file both FM160912,FM160913 are present. They are also in the taxonomy file. Perhaps this is a simple typo? Did you include the wrong fasta file on the subsample command? I would have expected a name like Aplysina_fulva_Clone_lib.merged.unique.fasta to match the names file if you ran it with unique.seqs.

Topic		Replies	Views
sub.sample() taxonomy error message "read missing" mothur bugs	5	3586	March 23, 2015
Issues subsampling data Commands in mothur	13	11650	September 11, 2014
Get corresponding taxomony to subsampled shared file Commands in mothur	7	6263	January 12, 2015
Downstream from subsample Commands in mothur	5	6457	February 20, 2012
sub.sample and taxonomy file problems mothur bugs	2	3645	January 13, 2012

duplicates in subsampled taxonomy files

Related topics