mothur

Need to separate Bacteria, Archaea and Eukarya sequences

Hello mothur helpers,

We are looking at all Bacteria, Archaea and Eukarya sequences in 11 samples (see below for primers/regions used) and it was cheaper and faster for our sequencing facility to do all three domains at the same time per sample. Hence we have a mixture of sequences from a mixture of amplicons. We are following the MiSeq SOP and using silva.align database. We couldn’t do the pcr.seqs step since we have different sequences all together. And now, after the alignment -align.seqs-, we are running into the same problem with screen.seqs. Our questions is the following (though we will use any advise and read any -polite- thoughts on anything about this!): we are wondering if maybe we should use the align.seqs step to separate our Bacteria from our Archaea and from our Eukarya; and do all the analyses from this step on with a domain at a time. Is there any way to extract the Bacteria-only sequences from silva.align? Our primers are shown below. We will be using the V4-V5 region in the future, but these are the ones we had at the time/knew about when we started this project. Thank you so much for your insights, a.

Bacteria:
1100F:YAACGAGCGCAACCC
1492R:GGGTTACCTTGTTACGACTT

Archaea:
344F:ACGGGGYGCAGCAGGCGCGA
915R:GTGCTCCCCCGCCAATTCCT

Eukarya:
516F:GGAGGGCAAGTGTGGT
1055R:CGGCCATGCACCACC

What is your issue with the align step? I think I’d have all the seqs together through alignment, then use screen.seqs after alignment to separate them. You’ll need to know the coordinants for each primer set, then use that as the start and end for each set. So you’ll run screen.seqs for the bac coords, save the fasta and count as bac, then run screen.seqs again on the original align for arch, …

Thank you so much. We will try that. Here is below our summary.seqs right after align.seqs. I am imagining that all of the sequences beginning at 35141 and ending at 43116 are Bacteria sequences; a bit puzzled about 43107 and ending at 43116, but maybe just taking the Bacteria ones out -almost certainly the most numerous- will help sort the other ones out. Thank you again very much. We are excited to get to the actual analysis!

mothur > summary.seqs(fasta=stability.paired.trim.contigs.good.unique.align, count=stability.paired.trim.contigs.good.count_table)

Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1046 1060 4 0 1 159937
25%-tile: 35141 43116 369 0 4 1599364
Median: 35141 43116 391 0 5 3198727
75%-tile: 35141 43116 395 0 5 4798090
97.5%-tile: 43107 43116 490 0 11 6237516
Maximum: 43116 43117 502 0 251 6397452
Mean: 31376 38415 322 0 4
# of unique seqs: 4576940
total # of seqs: 6397452

It took 10761 secs to summarize 6397452 sequences.

Sometimes random junk gets amplified and those sequences don’t align well. I think you want to use start=35141, end=43116, and maxhomop=8

Pat

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.