lane1349.gg.filter - Pat Schloss's transcription of the mask

Dear mothur and beloved children,

I was wondering why there is a third greengenes compatible lane mask to use with filter.seqs? I notice the Lane mask that is currently provided on the GG website is identical to the second one on “http://www.mothur.org/wiki/Lane_mask” (1287), but I wonder what the third one is.

Is the third one better according to you (to use before phylogenetic reconstruction) or does it depend on what you want to do with it?

For your information, I have added a small amount of 454 reads to the total Greengenes reference alignment using align.seqs and I want to calculate a phylogenetic tree of the resulting merged alignment using FastTree. Therefore I want to mask the variable regions of the alignment that could interfere with correct phylogenetic reconstruction

Thank you for any information you would be able to provide

Sam

I think the 3rd one may have been lifted from the greengenes ARB database, but I’m not 100%. Regardless, unless one is doing phylogenetics, I would strongly encourage people to stay away from using these types of masks as they mute the genetic diversity between sequences and make things look more similar than they really are. This is appropriate for a broad-scale phylogeny, but not fine scale OTU-based analyses.

Thank you for your answer.

broad-scale phylogeny is what I’m trying to do here. I’m just curious about what the differences are between the three greengenes-compatible lane masks you provide:

lane1241.gg.filter - A Lane Masks that comes with the greengenes arb database
lane1287.gg.filter - A Lane Masks that comes with the greengenes arb database
lane1349.gg.filter - Pat Schloss’s transcription of the mask from the Lane paper

When or why should one use for example Lane1287 instead of Lane1349? I have a hard time choosing one out of these three, because I am a little bit in the dark as to based on what I should choose one of these

Kind Regards,

Sam

i believe that the 4 digit number is the number of columns that will come out of the filtering of full length sequences. a better approach might be to use the soft filter option to remove any columns where the most common base in a position occurs in less than 50% of the sequences.

Thanks for the advice! Not sure if I understand why a lane mask is not advisable in my case though:

Is the Lane mask not advisable because the alignment also contains short 454 sequences? Because the Greengenes consortium does however use a Lane mask before constructing their large phylogenetic trees…And the only difference between their alignment and mine are the extra 454 sequences I added. So I’m guessing that is why you are suggesting a soft mask?

Or did I misunderstand and is there another reason why not to use the lane mask for broad scale phylogeny here

Kind Regards,

Sam

Well the Lane mask was developed in 1991ish to make a phylogeny between the three domains. If you can find the original paper (good luck!) it’s based on an alignment of about 10 sequences. My understanding is that the soft mask is to be preferred because it will do a better job of “fitting” the actual data you have.

=> Ok especially this explains a lot, didn’t know that. Thank you for the clarification.