filter.seqs

  1. I used the command filter.seqs(fasta=*.fasta, vertical=T, trump=.) after Chimera.slayer and Remove.seqs steps, the result shows as below:

Length of filtered alignment: 1 Number of columns removed: 3829 Length of the original alignment: 3830 Number of sequences used to construct filter: 67014

Output File Names:
Batch_5_soil.filter
Batch_5_soil.trim.unique.pick.good.filter.pick.filter.fasta

summary seqs above:
Start End NBases Ambigs Polymer
Minimum: 0 0 0 0 1
2.5%-tile: 1 1 1 0 1
25%-tile: 1 1 1 0 1
Median: 1 1 1 0 1
75%-tile: 1 1 1 0 1
97.5%-tile: 1 1 1 0 1
Maximum: 1 1 1 0 1

of unique seqs: 67014

total # of seqs: 77803

  1. if I use the filter.seqs (fasta=*.fasta, vertical=T) without trump=., the result is:

Length of filtered alignment: 1819
Number of columns removed: 2011
Length of the original alignment: 3830
Number of sequences used to construct filter: 67014

Output File Names:
Batch_5_soil.filter
Batch_5_soil.trim.unique.pick.good.filter.pick.filter.fasta

summary.seqs above

Start End NBases Ambigs Polymer
Minimum: 1 844 250 0 3
2.5%-tile: 1 1133 252 0 4
25%-tile: 1 1179 280 0 4
Median: 1 1398 385 0 5
75%-tile: 1 1631 419 0 5
97.5%-tile: 1 1794 494 0 6
Maximum: 844 1819 535 0 8

of unique seqs: 67014

total # of seqs: 77803

The question is what the Trump=., does? why does it make such big difference?

Many thanks,
Hui

trump=. will remove any column that has missing data. What you see is occurring because your sequences don’t fully overlap. You need to use screen.seqs as outlined in the Costello example to make sure that you sequences all have a similar start or stop point and then use filter.seqs. If you don’t use trump=. you run the risk of comparing apples to oranges since the 16S gene does not evolve at a uniform rate over its length.

Pat

Thanks Pat.

I re-analized the data using screen.seqs (…minlength=250, start=1044) followed by filter.seqs(…Vertical=T), chemiera.slayer(…), remove.seqs(accnos=…), and filter.seqs(fasta=, vertical=T, trump=.)

after chemira.slayer and remove.seqs, the summary.seqs look like:

Start End NBases Ambigs Polymer
Minimum: 1 840 250 0 3
2.5%-tile: 1 1151 252 0 4
25%-tile: 1 1198 280 0 4
Median: 1 1409 385 0 5
75%-tile: 1 1655 419 0 5
97.5%-tile: 1 1829 494 0 6
Maximum: 1 1866 535 0 8

of unique seqs: 66267

total # of seqs: 77060

but after filter.seqs(…vertical=T, trump=.), the summary.seqs became


Start End NBases Ambigs Polymer Minimum: 1 771 102 0 3 2.5%-tile: 1 771 149 0 3 25%-tile: 1 771 154 0 4 Median: 1 771 163 0 4 75%-tile: 1 771 170 0 5 97.5%-tile: 1 771 187 0 6 Maximum: 1 771 304 0 8 # of unique seqs: 66267 total # of seqs: 77060

Although the No of sequence did not change after filter.seqs(…trimp=.), but the base was significantly reduced from 250-535 to 102-301, almost half of the them was moved!! Does it make sense? or something wrong with align.seqs as you mentioned in one question about filter.seqs.

Thanks,

Hui

You could try something like the trim.seqs option 2 where you set the start position and optimize on the end position. Have you looked at the alignment to make sure it makes sense? Do you have the identity of the long sequence?

Hei, Pat, thanks a lot. The result seems ok now:-)

The filtered base (…vertical=T, trump=.) is from 191-370 after using screen.seqs (…minlength=250, start=1044, optimize=end). I also checked the alignment. It looks pretty ok too. Just wondering the defaut of screen.seqs command does not optimize the end even you set up the minlength and start position?

Thanks, Hui

That’s correct - optimize is not a default. Glad it’s working now…
Pat

Hei, Pat,

I am processing OTU-based analysis following the Costello stool analysis. The cluster (phylisp=, name=, cutoff=0.10) command after dist.seqss(fasta=, output=lt, cutoff=0.10) always changed the cutoff to 0.03 in the result, which affects the downstream processing for cutoff with 0.05 and 0.10. Do you have suggestion or advice for that?

thanks

Hui

Try using a cutoff of 0.20 or 0.25

Hei, Pat,

I have been running the Cluster after dist.seqs. But the process took too long time to be finished as the file is too huge(it is 27G after dist.seqs). I decided to split the final four files (final.fasta, final.groups, final.names, final.rdp.taxonmy) into 2 batches before dist.seqs and cluster. I could use split.group and merge.files commands to split final.fasta file, use get.groups to split final.group files. Do you have any suggestions to split final.names used for cluster step and split final.rdp.taxonomy ? As I do not want to start from beginning of the pre-process of sequences. Many thanks,

Hui