duplicates in subsampled taxonomy files

(sorry for changing the topic etc, have a slight fever making everything a bit more confusing)

I’ve read somewhere on the mothur forum that the subsampling procedure is without replacement… but when running the subsample command on a list, taxonomy , names and groups file I get errors which looks like the procedure might have been subsampling the same sequences twice.

This is the error I get when running the classify.otu cmd

mothur > classify.otu(list=Aplysina_fulva_Clone_lib.merged.square.fn.subsample.list, taxonomy=Aplysina_fulva_Clone_lib.merged.silva.wang.subsample.taxonomy, cutoff=80, label=0.030, reftaxonomy=../silva.bacteria/silva.bacteria.silva.tax)
[ERROR]: GU982099 is already in your taxonomy file, names must be unique.
[ERROR]: FM160936 is already in your taxonomy file, names must be unique.

Checking the original files, they only contain one occurrence of either sequence (one for each distance in the case of the list file), so it seems these are subsampled twice…?

Also, when re-running everything, the file which contain duplicates change as well as the identity of the duplicated seqs.

When I subsample, I run;

mothur > sub.sample(list=Aplysina_fulva_Clone_lib.merged.square.fn.list, taxonomy=Aplysina_fulva_Clone_lib.merged.silva.wang.taxonomy, name=Aplysina_fulva_Clone_lib.merged.names, group=Aplysina_fulva_Clone_lib.merged.groups, persample=T, label=0.030)
Sampling 21 from each group.
0.030
Sampling taxonomy and name file... 
/******************************************/
Running command: get.seqs(dups=f, name=Aplysina_fulva_Clone_lib.merged.names, taxonomy=Aplysina_fulva_Clone_lib.merged.silva.wang.taxonomy, accnos=temp.accnos)
Selected 105 sequences from your name file.
Selected 108 sequences from your taxonomy file.

Output File Names: 
Aplysina_fulva_Clone_lib.merged.pick.names
Aplysina_fulva_Clone_lib.merged.silva.wang.pick.taxonomy

/******************************************/
Done.

Output File Names: 
Aplysina_fulva_Clone_lib.merged.square.fn.subsample.list
Aplysina_fulva_Clone_lib.merged.subsample.groups
Aplysina_fulva_Clone_lib.merged.subsample.names
Aplysina_fulva_Clone_lib.merged.silva.wang.subsample.taxonomy


mothur > quit()

Thanks,

I’ve tried re-running everything, but end up with the same problem.

The problem is this. Look at the number of sequences getting subsampled.

mothur > sub.sample(list=./Aplysina_fulva_Clone_lib.merged.square.fn.list, taxonomy=./Aplysina_fulva_Clone_lib.merged.silva.wang.taxonomy, name=./Aplysina_fulva_Clone_lib.merged.names, group=./Aplysina_fulva_Clone_lib.merged.groups, persample=T, label=0.030)

Sampling 21 from each group.
0.030
Sampling taxonomy and name file... 
/******************************************/
Running command: get.seqs(dups=f, name=./Aplysina_fulva_Clone_lib.merged.names, taxonomy=./Aplysina_fulva_Clone_lib.merged.silva.wang.taxonomy, accnos=temp.accnos)
Selected 105 sequences from your name file.
Selected 110 sequences from your taxonomy file.

Note that;

Selected 105 sequences from your name file.
Selected 110 sequences from your taxonomy file.

Also note that these numbers change between runs, sometimes it’s 2 seqs sometimes its 5 as shown here (so maybe it has to do with the dups flag?)

This difference translates to 5 duplicated sequences as seen below when running e.g. the classify.otu cmd.

mothur > classify.otu(list=./Aplysina_fulva_Clone_lib.merged.square.fn.subsample.list, taxonomy=./Aplysina_fulva_Clone_lib.merged.silva.wang.subsample.taxonomy, name=./Aplysina_fulva_Clone_lib.merged.subsample.names, group=./Aplysina_fulva_Clone_lib.merged.subsample.groups, cutoff=80, label=0.030, persample=T, reftaxonomy=../silva.bacteria/silva.bacteria.silva.tax)

[ERROR]: FM160913 is already in your taxonomy file, names must be unique.
[ERROR]: GU982075 is already in your taxonomy file, names must be unique.
[ERROR]: GU982078 is already in your taxonomy file, names must be unique.
[ERROR]: FM160937 is already in your taxonomy file, names must be unique.
[ERROR]: FM160941 is already in your taxonomy file, names must be unique.

Am I doing something very wrong here? How can I fix this? 

Thanks,

How did you merge the files? Looking at the 2 commands you lists the first time mothur is selecting 21 from each group and then 42 from each group. This makes me think perhaps you are merging the same file in twice?

Sarah, thanks for replying. Sorry, not the best examples. In the first example, it’s another sample than shown in the second. For clarity, I’ll update it so it shows the same sample. All original fasta and group files have been used in other commands, thus potential duplicates would have appeared before.

Could you send your log file and input files to mothur.bugs@gmail.com so I can track down the issue for you?

absolutely! thanks!

I found the problem. One of the sequences that becomes a duplicate is in your taxonomy file, so mothur is assuming it is a unique. But it is listed in the names file as a duplicate. Any sequence in column 2 is assumed to be a duplicate. FM160913 is such a sequence. I noticed the word merge in the filename. How did you merge the files?

Thanks for looking into it Sarah. However, the merge syntax in the files names is created by me, when concatenating fasta and creating group files (my seqs are not NGS). I used the unique.seqs command to create a names file using my fastas, so this is where the problem is? Why wouldn’t the unique.seqs command create the expected output using my fasta files? Thanks again,

I am assuming the taxonomy file was created from running classify.seqs on the fasta file. The issue is looking is the fasta file both FM160912,FM160913 are present. They are also in the taxonomy file. Perhaps this is a simple typo? Did you include the wrong fasta file on the subsample command? I would have expected a name like Aplysina_fulva_Clone_lib.merged.unique.fasta to match the names file if you ran it with unique.seqs.