sub.sample() taxonomy error message "read missing"

Hi all,

I am using the sub.sample() command to subsample a subset of MiSeq sequeces (which is always too big to process for OTU analysis, I have 1043161 unique reads after denoising and chimera check, I only want around 500-1500 unique reads).

My command looks like this:

sub.sample(fasta=all.otu.unique.fasta, name=all.otu.names, group=all.otu.groups, taxonomy=all.otu.rdp.taxonomy.80,size=500)

Where the fasta file contains all unique reads, name file contains the names of reads that are the same, group file contains which sample each reads belongs to, and taxonomy file contains the taxonomy assignment for each reads (not unique reads).

After I run this command, I got outputs and error message:

Sampling 500 from 13879970.
Deconvoluting subsampled fasta file...
/******************************************/
Running command: unique.seqs(fasta=all.otu.unique.subsample.fasta)
500     331

Output File Names:
all.otu.unique.subsample.names
all.otu.unique.subsample.unique.fasta

/******************************************/
Done.
[ERROR]: M01246_29_000000000-A461D_1_1101_13732_3258 is missing, please correct.
[ERROR]: M01246_29_000000000-A461D_1_1101_14913_3491 is missing, please correct.
[ERROR]: M01246_29_000000000-A461D_1_1101_20287_3764 is missing, please correct.

This is not a complete list of error message.

I wasn’t sure what it is talking about saying something is missing, so I checked using the very first read.

It is in the original name file, also in the group file, and taxonomy file. But it is not in the fasta file, since it is a repeated read. So how it is missing, if the read name is in the input file I provided? I am wondering whether I am doing something wrong? Should I use the not-unique fasta in this command?

However, I did get output files that seem OK:

Output File Names:
all.otu.subsample.names
all.otu.rdp.taxonomy.subsample.80
all.otu.unique.subsample.fasta
all.otu.subsample.groups

But this error message bothers me. I am afraid this would have some affect on the subsampled reads. Is anyone having the same error message? What is the affect if I am just using the output files here, ignoring the error message? I am welcome to any suggestions.

PS, I tried with different versions of mothur,they all give the same error message.

Thank you,

Eddi

I suspect it’s an issue with the names file format. Could you post a line in the names file that contains the one of the “missing” names?

Thanks for your reply. I have taken M01246_29_000000000-A461D_1_1101_13732_3258 as the missing name. It has the same sequence to M01246_29_000000000-A461D_1_1101_6986_10337. The following is not a complete list, the complete list is very long.

M01246_29_000000000-A461D_1_1101_6986_10337     M01246_29_000000000-A461D_1_1101_6986_10337,M01246_29_000000000-A461D_1_1102_21228_3750,M01246_29_000000000-A461D_1_1103_7955_10631,M01246_29_000000000-A461D_1_1104_16068_11765,M01246_29_000000000-A461D_1_1104_17118_13212,M01246_29_000000000-A461D_1_1104_5934_16054,M01246_29_000000000-A461D_1_1104_21029_23119,M01246_29_000000000-A461D_1_1105_22950_6934,M01246_29_000000000-A461D_1_1105_24399_11228,M01246_29_000000000-A461D_1_1106_20861_19961,M01246_29_000000000-A461D_1_1107_21427_5556,M01246_29_000000000-A461D_1_1107_14126_11284,M01246_29_000000000-A461D_1_1107_14178_12538,M01246_29_000000000-A461D_1_1107_22649_12745,M01246_29_000000000-A461D_1_1107_7644_17247,M01246_29_000000000-A461D_1_1107_23830_18850,M01246_29_000000000-A461D_1_1108_10610_21847,M01246_29_000000000-A461D_1_1110_10744_15878,M01246_29_000000000-A461D_1_1112_8166_10114,M01246_29_000000000-A461D_1_1113_11396_26386,M01246_29_000000000-A461D_1_1114_13721_20352,M01246_29_000000000-A461D_1_2101_19564_5987,M01246_29_000000000-A461D_1_2101_25070_12450,M01246_29_000000000-A461D_1_2101_6907_12477,M01246_29_000000000-A461D_1_2101_23602_14438,M01246_29_000000000-A461D_1_2101_6541_16101,M01246_29_000000000-A461D_1_2101_15154_23084,M01246_29_000000000-A461D_1_2102_7457_5266,M01246_29_000000000-A461D_1_2102_17362_20797,M01246_29_000000000-A461D_1_2103_25747_19114,M01246_29_000000000-A461D_1_2103_7672_20752,M01246_29_000000000-A461D_1_2103_13431_28150,M01246_29_000000000-A461D_1_2104_21647_9101,M01246_29_000000000-A461D_1_2104_10423_15152,...

The sample M01246_29_000000000-A461D_1_1101_13732_3258 is at the beginning middle (25%) part of the name list.

The names file was generated by using unique.seqs() command. Let me know if there is any problem with the format.

Eddi

The format looks fine. Could you send your log file, fasta, name, group and taxonomy files to mothur.bugs@gmail.com?

Hi, thanks for looking into this issue. However, the files are very big, about 5 G total. Please see below. I don’t know whether gmail can handle this. If it can not, do you have alternative ways to transfer the files?

-rw-r--r-- 1 hl0333 pi_qd0005 651M Mar 10 16:22 all.otu.unique.fasta
-rw-r--r-- 1 hl0333 pi_qd0005 2.1G Mar 10 16:21 all.otu.rdp.taxonomy.80
-rw-r--r-- 1 hl0333 pi_qd0005 631M Mar 10 16:21 all.otu.names
-rw-r--r-- 1 hl0333 pi_qd0005 700M Mar 10 16:21 all.otu.groups

And the logfile is anther 1 G because there are a lot of missing read errors listing in it.
Eddi

Hi Eddi,
The error is coming from the taxonomy file. It does not seem to match the other files. It only contains 199 sequences.
Kindly,
Sarah Westcott