In several studies using 454 pyrysequencing, one can read that that reads below e.g. 200 bp. were filtred out, leaving only reads above 200 bp. for further processing. This means that one have reads of different length. How can one assign otu’s for sequences that differs in length? E.g. the only difference between two reads is the length of the reads, not a real difference in sequence. Is there a way to overcome this problem without reducing the length of all sequences to 200 bp.
Yup. All of the sequences need to be trimmed so they overlap over the same region. This is because the 16S gene does not evolve evenly over its length. So having some sequences be longer than others could involve adding more or less variable sites and skewing the output. I generally try to go for a length where I am able to keep 95% of the sequences. By the way, this advice holds for phylotyping and OTUing.
Thanks for the response. So as I see it, many studies overestimate the number of OTU’s because they compare reads at different length, right? Regarding your analysis of the Sogin and Costello data, you do not take this problem into account for the Sogin data, while you have dealt with the problem in the Costello analysis using the screen.seqs command, correct?
I suspect in the 5’ end and the V6 you could inflate the number of OTUs and at the 3’ end you could deflate them - but I haven’t done the experiment yet. You are right about how the example pages were done, I/someone should probably go back and trim the Sogin dataset and redo the analysis.
Thanks for your questions!