I think I’m getting bad results from the pre.cluster command with method=unoise in both Mother 1.43.0 and 1.44.3 e.g. processing a 16S v4 data set of 20 samples through the Schloss MiSeq SOP up to the pre.cluster command and then running:
pre.cluster(count=my.count_table, fasta=my.fasta, method=unoise, processors=N)
will produce map files where the diffs column values typically have the following strange distribution:
$ cut -f4 *.map | sort -n | uniq -c
29564 0
20 diffs
47476 1
17807 2
5398 3
2696 4
903 5
2 6
520 8
where the number of unique sequences belonging to a cluster as expected drops off going from 1 diff to 6 but then there is a spike of (here 520) unique sequences with 8 differences to their main sequence. For a particular run these diffs=8 cases may typically occur in some map files but not others. Which map files have diffs=8 cases changes between re-runs of exactly the same command. There may be a loose relationship between the processor=N option and the number of affected map files. So, with N=20 or higher sometimes I get a run resulting in no diffs=8 cases at all. With N=19 there is typically only one map file with diffs=8 cases with more appearing as N goes lower maybe. In any case, between many repeated runs and some variation of the number of processors, the resulting map file for a given sample only has exactly two possible outcomes (one with diffs=8 mapping and one without).
The above is from a subset of a much larger dataset for which I ran the SOP twice and for one of those runs get for the diffs distribution for all the map files resulting from pre.cluster (the other run looks similar enough):
$ cut -f4 *[0-9].map | sort -n | uniq -c
8338400 0
7021 diffs
13934448 1
4819969 2
1751962 3
597800 4
128355 5
4079 6
12 7
263628 12
Here I got sequences with 12 differences to the center of their respective cluster but none with 8, 9, 10, or 11. In this case about 5% of samples are affected the rest have a max diffs or 7 or below.
Some background: I’m trying to recover ASVs as well as 97% OTUs and chose the unoise pre.clustering method because it can resolve the 1 base variation present in the 16S sequences in one of the organisms in the Zymo mock community. I’d be grateful for any advise on how to proceed. Currently I exclude the 5% affected samples from downstream analysis. For 97% OTUs I should probably repeat pre.cluster with the usual diffs=2 method.
Let me know if you’d like me to provide more information
Thanks,
Robert