Hi, Mothur community!
I am working with Miseq data from gut communities from 6 insect species. From my original dataset of 4’105.742 after trim short, nucleotide ambiguity, unaligned, non-bacterial and chimeric sequences, my total number of sequences are 2’062.377 and after unique.seqs and pre.cluster of 2 bp, 640.944 unique sequences. For every species I had something like 300.000 sequences and 90.000 unique sequences, and in the clustering at 97%, I found more than 200 OTUs per sample some even with 1000!!! (and you know for insect gut I must be something like 20-30 OTUs), with most of the OTUs as singletons and underrepresented OTUs (less than 10 seqs), so I guess I have a lot of spurious reads.
I want to back in my pre-processing steps, to try to identified the reads with sequencing errors. I changed in the pre.cluster the threshold from 2 to 4 (that will represent a error in the 1-2% of the sequence length of 427 bp), and now I have 332.577 unique sequences. Also, using split.abund with cutoff=1, I found 314.161 singletons (7.6% of my original dataset), what I still think that is a high number.
What do you recommend? Try with a higher pre.cluster or maybe eliminate the singletons or other alternative?
Thank you,
Sebastián