Reads in count table after unique.seqs != actual reads in original file

Dear,

After using Mothur for ten years and updating to v1.47 I noticed that when using unique.seqs on a fasta file (coming out a UParse pipeline, without creating any unique sequencing prior the analysis) the amount of sequences in the resulting count_table is very high or very low. I am working in a VM virtualbox on Ubuntu20.04LTS.

I counted them for a certain sample in the count_table and compared with a count after using grep on the original fasta file. The sum in the count_table for all reads associated to the unique reads of a first sample is 61.907, while grep returns 30.600 sequences… On the other hand, I also notice that for another the sample the opposite is true: total sum of all unique reads in count_table is 1.746 while grep returns 27.890… Ofcourse, deeper in the pipeline this results in big imbalance between the samples, in which one believes that the first sample takes almost all reads while another sample has a very low amount of reads.

I actually use Mothur to remove Chloroplast/Mitochondria/…, but to win computer time I use the unique.seqs command. I am aware that since v1.47 the names parameter is replaced by count, and do take the representative count_tables further along in the classify.seqs and remove.lineage commands. But, I noticed the problem already happens in the unique.seqs.

I tested this on three datasets, two large ones (>300 samples) and my personal test data sets (6 samples, of which I discussed the results here). I have not used the mothur test data.

I have the feeling that this is a bug in the v1.47 version, because if I remove the decontamination step from my analysis the imbalance is not happening.

Any comments or suggestions on this?
Kind regards,

Sam

Hello, I am sorry that I cannot help. I have rerun 2 different projects with the new version and all I get is the same that I got. I am using Mothur from start to finish.

Maybe it is the mixing of pipeline (Uparse + Mothur) that is making problems?

Can you post the commands that you are running? I wonder if you’re using fasta and count files that don’t go together at some point in the workflow. I would also recommend using mothur all the way through rather than mixing and matching with UParse. Is there something you get from UParse that you aren’t able to get from mothur?

Thanks,
Pat

Dear Pat,

Thanks for your reply, I managed to go back to creating name files using the format option, and it seems the problem is solved in such a way. After the update I ran unique.seqs(fasta=file.fa), this created a file.unique.fa and file.count_names. The count file created only one column, where I presumed it should create a column per group. I think my problem is that I don’t specify my groups (as would come from the make.contigs command?). So, to be honest, it is not a bug (?) but I wrong use of the command without group names?

I like to implement a part of the UParse pipeline, since I believe in filtering on expected error, the output of this filtering step is send for removal of Chloroplast/… with mothur, while this output is send to the UNoise algoritm to create zOTUs. I am ofcourse open for discussion on whether this is indeed the best way to go.

Further, I explain my students your complete pipeline to avoid the linux work. So I really appreciate your work and online published SOP.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.