After using Mothur for ten years and updating to v1.47 I noticed that when using unique.seqs on a fasta file (coming out a UParse pipeline, without creating any unique sequencing prior the analysis) the amount of sequences in the resulting count_table is very high or very low. I am working in a VM virtualbox on Ubuntu20.04LTS.
I counted them for a certain sample in the count_table and compared with a count after using grep on the original fasta file. The sum in the count_table for all reads associated to the unique reads of a first sample is 61.907, while grep returns 30.600 sequences… On the other hand, I also notice that for another the sample the opposite is true: total sum of all unique reads in count_table is 1.746 while grep returns 27.890… Ofcourse, deeper in the pipeline this results in big imbalance between the samples, in which one believes that the first sample takes almost all reads while another sample has a very low amount of reads.
I actually use Mothur to remove Chloroplast/Mitochondria/…, but to win computer time I use the unique.seqs command. I am aware that since v1.47 the names parameter is replaced by count, and do take the representative count_tables further along in the classify.seqs and remove.lineage commands. But, I noticed the problem already happens in the unique.seqs.
I tested this on three datasets, two large ones (>300 samples) and my personal test data sets (6 samples, of which I discussed the results here). I have not used the mothur test data.
I have the feeling that this is a bug in the v1.47 version, because if I remove the decontamination step from my analysis the imbalance is not happening.
Any comments or suggestions on this?