I used the command filter.seqs(fasta=*.fasta, vertical=T, trump=.) after Chimera.slayer and Remove.seqs steps, the result shows as below:
Length of filtered alignment: 1
Number of columns removed: 3829
Length of the original alignment: 3830
Number of sequences used to construct filter: 67014
if I use the filter.seqs (fasta=*.fasta, vertical=T) without trump=., the result is:
Length of filtered alignment: 1819
Number of columns removed: 2011
Length of the original alignment: 3830
Number of sequences used to construct filter: 67014
trump=. will remove any column that has missing data. What you see is occurring because your sequences don’t fully overlap. You need to use screen.seqs as outlined in the Costello example to make sure that you sequences all have a similar start or stop point and then use filter.seqs. If you don’t use trump=. you run the risk of comparing apples to oranges since the 16S gene does not evolve at a uniform rate over its length.
I re-analized the data using screen.seqs (…minlength=250, start=1044) followed by filter.seqs(…Vertical=T), chemiera.slayer(…), remove.seqs(accnos=…), and filter.seqs(fasta=, vertical=T, trump=.)
after chemira.slayer and remove.seqs, the summary.seqs look like:
Although the No of sequence did not change after filter.seqs(…trimp=.), but the base was significantly reduced from 250-535 to 102-301, almost half of the them was moved!! Does it make sense? or something wrong with align.seqs as you mentioned in one question about filter.seqs.
You could try something like the trim.seqs option 2 where you set the start position and optimize on the end position. Have you looked at the alignment to make sure it makes sense? Do you have the identity of the long sequence?
Hei, Pat, thanks a lot. The result seems ok now:-)
The filtered base (…vertical=T, trump=.) is from 191-370 after using screen.seqs (…minlength=250, start=1044, optimize=end). I also checked the alignment. It looks pretty ok too. Just wondering the defaut of screen.seqs command does not optimize the end even you set up the minlength and start position?
I am processing OTU-based analysis following the Costello stool analysis. The cluster (phylisp=, name=, cutoff=0.10) command after dist.seqss(fasta=, output=lt, cutoff=0.10) always changed the cutoff to 0.03 in the result, which affects the downstream processing for cutoff with 0.05 and 0.10. Do you have suggestion or advice for that?
I have been running the Cluster after dist.seqs. But the process took too long time to be finished as the file is too huge(it is 27G after dist.seqs). I decided to split the final four files (final.fasta, final.groups, final.names, final.rdp.taxonmy) into 2 batches before dist.seqs and cluster. I could use split.group and merge.files commands to split final.fasta file, use get.groups to split final.group files. Do you have any suggestions to split final.names used for cluster step and split final.rdp.taxonomy ? As I do not want to start from beginning of the pre-process of sequences. Many thanks,