filtering or screening or getting sequences with a mask

Hi Pat and mothur aficionados,

I have been trying to figure this out for a little while but I keep hitting a wall. Is there an easy way to either screen, filter or get aligned sequences that fit to a user defined mask or a user-defined consensus sequence. I can think of several ways to make the mask or consensus sequences (e.g. generate consensus sequence in mothur or generate a mask in ARB) but I can’t figure out how to pull the sequences out that fit that this mask. I want to screen a pyrosequencing library with 69,000 sequences.

Any ideas how to figure this out would be greatly appreciated,

In filter.seqs there is an option to use a “hard” mask, which is a user supplied mask consisting of 0’s and 1’s to indicate which columns to chuck and which to keep. We provide versions of the Lane mask for the greengenes and SILVA alignments, but you could “easily” make your own. Is this what you mean?

Hi Pat,
ahh, okay, thank you, I finally understand what a mask is. I was thinking of it a little differently, maybe more like a consensus sequence.

My idea was

  1. I have a FISH probe that hits a particular population of cells
  2. I also have a clone library from that environment,
  3. From the clone library I select the sequences that match my probe
  4. I then trim the sequences to just include a variable region
  5. obtain a consensus sequence from the variable region of the trimmed clones
  6. get the sequences in my pyrosequencing library that match that consensus sequence
  7. follow the rest of the Costello stool sample pipeline to look at how those sequences differ through out my pyrosequencing dataset (various sample sites, sampling dates etc).

1-5 seems pretty straight forward but I am stuck on number 6. I think there must be a way to do it in mothur but I just can’t figure out the best command.

Thank you,

Emily -
Hmmm. For #6 - could you align your probe sequence? If so, you could use classify.seqs(method=knn, numwanted=1, search=distance) and give it your probe and sequences. You would then get an output which would be the distance between your sequences and your probe. Those that are dead on or within a given threshold would be what you’d want (I think). This is something we’ve been thinking about for removing “contaminants”, but at this point don’t have a better approach yet. Let me know what you think.


Hi Pat,

Thanks, Hmm. The probe sequence isn’t in the variable region maybe the mask is the route to go then. So I take the clones, align them, import them into an alignment editor (in this case ARB), trim down the region to the variable region targeted in my pryosequencing library, generate a consensus sequence for this region, export it.

Then I add the trimmed clone sequences to the pyrosequencing dataset (as taxa controls) make the mask and mask over all of variable base pairs, group the sequences in the pyrosequencing dataset, obtain the names for each grouping, find my known clones and then get.seqs using the accnos file.

Okay maybe this is what I will try. It will probably take some more tweaking but I can let you know how it goes.