screen.seqs: Where does it start and end?

Hi there,

I just recently started using mothur and I am trying to learn and understand the program and its commands…Please let me know if you see any mistakes or if you have any input…that would be extremely helpful.
I have a question regarding the screen.seqs command. I have been playing around with the parameters and came across an issue I would like to have your opinion about.

  • Here is the summary of my data:
    mothur > summary.seqs(fasta=VS150110.unique.align)

Start End NBases Ambigs Polymer
Minimum: 1039 6330 20 0 3
2.5%-tile: 1125 6333 207 0 3
25%-tile: 1461 6333 218 0 4
Median: 1789 6333 226 0 4
75%-tile: 1793 6333 237 0 5
97.5%-tile: 2053 6334 249 1 6
Maximum: 43046 43116 282 7 31

of Seqs: 8366

  • Most sequences stop at 6333, but, unfortunately, the sequences don’t start at the same position. There are also some sequences starting on the hight end (see maximum values). Because I don’t have a common start, I decided to use the end-parameter of 6333 to remove sequences that don’t stop at this position:

mothur > screen.seqs(fasta=VS150110.unique.align, minlength=50, maxhomop=8, end=6333)
mothur > summary.seqs(fasta=VS150110.unique.good.align)

Start End NBases Ambigs Polymer
Minimum: 1039 6333 200 0 3
2.5%-tile: 1125 6333 207 0 3
25%-tile: 1461 6333 218 0 4
Median: 1789 6333 226 0 4
75%-tile: 1793 6333 237 0 5
97.5%-tile: 2053 6334 249 1 6
Maximum: 37810 42586 282 7 8

of Seqs: 8349

  • Unfortunately, it didn’t cut out the sequences on the high end and it is probably because the are starting after the end parameter was set (after 6333). So I though maybe it would be possible to remove them by setting the start parameter over all the sequences I would like to keep (start=6335).

mothur > screen.seqs(fasta=VS150110.unique.align, minlength=50, maxhomop=8, start=6335)
mothur > summary.seqs(fasta=VS150110.unique.good.align)

Start End NBases Ambigs Polymer
Minimum: 1039 6330 200 0 3
2.5%-tile: 1125 6333 207 0 3
25%-tile: 1461 6333 218 0 4
Median: 1789 6333 226 0 4
75%-tile: 1793 6333 237 0 5
97.5%-tile: 2051 6334 249 1 6
Maximum: 4961 6386 282 7 8

of Seqs: 8342

mothur > summary.seqs(fasta=VS150110.unique.bad.align)

Start End NBases Ambigs Polymer
Minimum: 1461 6333 20 0 3
2.5%-tile: 1461 6333 20 0 3
25%-tile: 32525 40321 218 0 4
Median: 32525 40330 221 0 5
75%-tile: 37705 42559 233 0 7
97.5%-tile: 43046 43116 262 2 31
Maximum: 43046 43116 262 2 31

of Seqs: 24

Although it worked, I am not sure if it is okay to use the screen,seqs command this way (especially because the start parameter is meant to be used in a different way). Is it acceptable or is it cheating???
Why does the parameter start=6335 ignore the sequences below 6335 (and end=6333 sequences above 6333)?..Wouldn’t it theoretically remove all the sequences that don’t start at 6335?
Is it a bad alignment if all the sequences don’t start at the same position?

Any input would be great. Please let me know if you have questions.

Cheers,
Verena

Hi Verena,

Try this: filter.seqs(fasta=VS150110.unique.align, trump=.)

This should remove any column which contains a “.” even if there are nucleotides in columns above and below (i.e. this will remove columns in the staggered beginning sequence areas, but you will still keep the rest of the sequence). This will also remove columns containing all “.” or “-” which is good because it speeds up building the distance matrix. This should also take care of the rogue sequence at the high end of the alignment.

Using start= and stop= in screen.seqs is useful only if you want to throw out whole sequences that don’t match your criteria.

Let me know if this does what you’re trying to do.

Ryan

Hi Ryan,

Thanks for your input. I appreciate it!

I run the alignment through filter.seqs, but, unfortunately, it removed all the columns…hmmmm…

mothur > filter.seqs(fasta=VS150110.unique.align, trump=.)

Length of filtered alignment: 0
Number of columns removed: 50000
Length of the original alignment: 50000
Number of sequences used to construct filter: 8366

I assume that means that the alignment is not the greatest (well, there isn’t one left). Any ideas on how I can improve this situation?..help…

Cheers,
Verena

Hi Verena,

The problem comes back to those sequences that were at the high end of the alignment that didn’t overlap with the rest of your seqs. (i.e. some of the 24 sequences you removed in your VS150110.unique.bad.align file of your first post). So try this:

screen.seqs(fasta=VS150110.unique.align, start=2000)
#This will remove offending sequences that don’t start before the 2000th column. You could use a lower number too, like 1800, but your going to loose some sequences. It is the infamous battle of preserving quantity of sequences vs. quantity of sequence overlap for analysis :slight_smile:

now do what I suggested previously:

filter.seqs(fasta=VS150110.unique.good.align, trump=.)

Let me know if it works.

Cheers,

Ryan

Hi Ryan,

Thanks…that helped.

I think I got confused by the start parameter in screen.seqs…I thought that when using e.g. start=2000 that it would remove columns before the 2000th column (because the start is at 2000) rather than removing sequences that don’t start before 2000…if that makes sense. Anyhow, it did remove the sequences at the higher end.

Here is the outcome from the filter:

mothur > filter.seqs(fasta=VS150110.unique.align, trump=.)

Length of filtered alignment: 557
Number of columns removed: 49443
Length of the original alignment: 50000
Number of sequences used to construct filter: 8341

Seems like I get to keep part of my alignment this time! :smiley:

Thanks for your help Ryan!

Cheers,
Verena

When do you know which one to choose over the other?..What’s your opinion?

Ahh, I think this is personal to each dataset, especially because there is no way of knowing exactly which is “better” or more correct.

The benefit of opting towards more sequence overlap (at the expense of removing seqs) is to build a distance matrix made from a larger section of the 16s for the OTU’s present in the analysis. This should, in theory, better resemble the distance matrix created from the full 16s gene.

The benefit of choosing more sequences (at the expense of loosing shared alignment columns) should, in theory, better resemble the membership and structure of the community of organisms you are trying to study.

There is definitely a sweet spot somewhere in the middle. Just taking a look at the avg. number of ambiguities in your sequences and your # of seqs, I would tend to lean towards more sequence overlap in your case.

Does this make sense? Any one else care to chime in about this?

Cheers,

Ryan

My $0.02…

I would opt for more sequences over length. Personally, I typically select the values at the 2.5 and 97.5% levels so that I can keep about 95% of the sequences. To me length gets you better phylogenetic signal. In my contrarian way, I would say that the difference between 200 and 250 or 200 and even 300 is marginal in this regard. On the other hand having more sequences will boost your statistical power to do things like compare communities, get at richness, etc.

A word of warning - we’re noticing that the genome centers hate this logic because they enjoy being able to report how many bases they generate per $. If you then tell them that you’re either throwing out sequences or bases they start to rock nervously in their chairs.

That’s exactly what I do, Pat. go for the 95%. You can report that as a method that isn’t entirely ad hoc.

If I have ANY doubt about the quality of a sequence, I throw it out. That makes my results true, even if I miss something else.

I think all of you make a valid point.

I guess, in the end what it really comes down to is the question you are asking. In my case I am very interested in the difference of my samples to each other and therefore am very hesitant to throw out any sequence (that potentially might be important). However, I also want to make sure to the quality of the alignment is efficient enough to give me the “right” answer (with a significance).

Granted, one could say what are 100 sequences out of 8000 unique sequences (all together 45000 sequences) that might not have a big impact. Nevertheless, it does matter when one unique sequence to be removed has 15 more identical sequences stored away in the names list (that subsequently would be removed to). To me that means that this sequence is real, not a sequencing error and should be included, if possible.

Here is another question to think about: how much does the diversity of the sample (sequences) influence the alignment? The more diverse sequences are (in the variable region), the less they might “fit together”, which in return would influence the alignment (alignment overlap and length) and consequently what sequences to let go and which to keep. What do you think?
Additionally, one would probably get a better alignment from 500 sequences versus 5000.

For me it is very difficult to make a call when to throw out data/sequences and feel comfortable with doing so….because you never know what you might miss….but, consequently, you’ll also never find out.

Cheers,
Verena

By the way, I also figured out why some of the sequences started at column 37000 in the example above. Turns out that those sequences were still reversed and not “flipped around” in the alignment although flip=t was “on”. Making the gapopen parameter more stringent (gapopen=-2.25) took care of that issue….now they are part of the alignment and didn’t have to be removed.

Here is another question to think about: how much does the diversity of the sample (sequences) influence the alignment? The more diverse sequences are (in the variable region), the less they might “fit together”, which in return would influence the alignment (alignment overlap and length) and consequently what sequences to let go and which to keep. What do you think?
Additionally, one would probably get a better alignment from 500 sequences versus 5000.

Good questions - actually, this isn’t a problem for mothur’s aligner. It takes each unaligned sequence and aligns it to the best match in the alignment database. Since the reference alignment is fixed, other sequences won’t affect the alignment quality. Also, having more or fewer sequences won’t affect the quality. If you are an unfortunate soul trying to use a program like clustal or muscle, we do know that those methods will inflate the distances between sequences as the number of sequences increases. So, as long as your reference alignment is good (i.e. using silva), you can be confident that you will get a good alignment out of align.seqs regardless of length or number of sequences.

…ahhh…that would also explain why it is okay to remove sequences with screen.seqs after they were aligned (instead of having to do it beforehand).