hi, i am wondering if the lengths of sequences will have any impact on the results of alignment or other further commands , such as clustering, rarefaction. Becauese the sequences in the primary files(before aligning ) are of differnent lengths,but similar beginnings(a primer of 16S rRNA). Actually, the shortes is about 800, the longest is aout 1200. If the original sequeces have different lengths, after aligning, the aligned sequences then are caculated for the pairwised distances, and finally clustered, and phylogenetic tree were constructed, does these results are reliable? Should i chop the sequences to the same length before i conduct the align command?
Also, if i want to identify the sequeces of bacteria on the level of genus, what length will be appropriate?
Thank you very much
It won’t affect aligning, but it will affect the distances you calculate and the results from classification. Because of this, I emphatically encourage people to use the filter.seqs(trump=.) command to blunt all sequences so they overlap the same region. Otherwise you’re comparing evolutionary apples to oranges.
…with the caveat that the screen.seqs command should ALWAYS be run before filter.seqs (with the potentially dangerous trump option) to remove the shortest sequences. Otherwise you run the risk of trimming all of your sequences to the length of the shortest input sequnce. It’s generally desirable to sacrifice a few sequences to ensure best coverage. In fact, I can think of a simple algorithm mothur could use to automatically suggest to the user what good begin and end values might be during screening. It would also be good human interface design if mothur nagged the user if greater than half the real bases were being thrown away during filtering.
Robin
Actually, i have used the filter.seqs,but without the “trump” factor,then is it ok to do the further caculation? I mean, is there any differences with or without the "trump"for the further caculation,such as “cluster”,“rarefaction”,“phylogenetic tree”? or if i filtered the sequenced without the “trump” , the results are reliable?cause i have done the serious caulation on the basis of filtering without "trump’ in the command.
I would say no, it’s not ok. What you’re doing is essentially saying that the 16S rRNA gene evolves uniformly across it’s length. We know this is not true (for more evidence see my recent PLoS Comp. Biol. paper). So you need to do the trump=. and then re-do everything that is downstream from there.
Thanks very much. Is using the "screen.seqs "command first to remove some short sequeces by setting the “start"and “end” ,and then the"filter.seqs” command a better stategy to get a better results,when coming to “cluster” ,“taxonomy"or"raerfaction”
Another question, how to draw the results of "libshuff"into a figure,as used when comparing the differences among clone libraries in many references.
That’s correct - check out the costello example analysis to see how to do this. As for libshuff, all you really need to present are the p-values - if either is significant, the two libraries are different.