Add subsequence removal function to unique.seqs

I was thinking that it might be useful add a function of unique.seqs that will help remove sequences that are a subset of a larger sequence. Here is an example:

Seq. A - ATGCCAGACGTAGCGGATCGGATCGGTAGGGTAGCTTT
Seq. B - ATGCCAGACGTAGCGGATC

Because right now, unique.seqs will keep these two separate sequences in the .unique file, right?

It would be totally killer if mothur could:
A.) Report sequences that fit this condition
*B.) Remove the smaller sequence from the .unique file

  • I realize that just reporting these sequences would be sufficient, because you could then use the get.seqs, or remove.seqs function to remove them if you wanted to. In certain cases, you may want to remove the longer sequence instead.

What do you think? I think when used appropriately, it could be useful for cleaning up replicates within your data, or even databases.

yeah, we’ve thought about it, but in the end you’d probably want to do an all vs. all alignment to find subsets, which would be slow. we suggest running align.seqs, removing the sequences that don’t align to the correct region. then filter the sequences with trump=… Then run unique.seqs again.

I want to implement this code from wikipedia:

Reading out an LCS

The following function backtraces the choices taken when computing the C table. If the last characters in the prefixes are equal, they must be in an LCS. If not, check what gave the largest LCS of keeping xi and yj, and make the same choice. Just choose one if they were equally long. Call the function with i=m and j=n.

function backTrace(C[0…m,0…n], X[1…m], Y[1…n], i, j)
if i = 0 or j = 0
return “”
else if X _= Y[j]
return backTrace(C, X, Y, i-1, j-1) + X_else
if C[i,j-1] > C[i-1,j]
return backTrace(C, X, Y, i, j-1)
else
return backTrace(C, X, Y, i-1, j)

But I received an error to the last line… can you tell me why?__

You might check out the brief discussion on the topic…