18S reference alignment (and other bugs)

Hi all

I have been trying to analyze 18S amplicon sequences using mothur. The primers used to generate the amplicon are 1391F and EukBr which amplify a ~360bp amplicon with a partial overlap of 143bp.
I initially ran a test with only 4 samples to see how it will go with the miseq sop. When I check the sequences before aligning them I get the following:

	Start	End	NBases	Ambigs	Polymer	NumSeqs
	Minimum:	1	262	262	0	4	1
	2.5%-tile:	1	326	326	0	6	225
	25%-tile:	1	326	326	0	10	2241
	Median: 	1	351	351	0	11	4482
	75%-tile:	1	357	357	0	12	6723
	97.5%-tile:	1	367	367	0	15	8739
	Maximum:	1	418	418	0	34	8963
	Mean:	1	346.909	346.909	0	11.1019
	# of unique seqs:	8936
	total # of seqs:	8963

I then align (to the entire silva alignment as I didnt know the position) using

$mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.nr_v138.align, processors=8)

and then on the summary seqs:

	Start	End	NBases	Ambigs	Polymer	NumSeqs
	Minimum:	0	0	0	0	1	1
	2.5%-tile:	1044	1048	2	0	1	224
	25%-tile:	40850	43116	207	0	4	2235
	Median: 	40868	43116	207	0	7	4469
	75%-tile:	41020	43116	207	0	10	6703
	97.5%-tile:	43113	43116	208	0	12	8713
	Maximum:	43116	43116	418	0	18	8936
	Mean:	36370	38191.2	166.906	0	6.8731
	# of Seqs:	8936

It seems like apx 150bp of sequences have dissapeared. I pulled out some sequences of the aligned file (.trim.contigs.good.unique.align) and some sequences before aligning (.trim.contigs.good.unique.fasta) and when i align them to each other it seems that only the region covering the reverse read has been retained (which includes the partial overlap with the forward read) reducing the sequences to 207bp!

I then when to the silva aligner Alignment, Classification and Tree Service and uploaded some of the sequences that I ve been trying to align (a subset of .trim.contigs.good.unique.fasta) which suggested that the position in the silva alignment should be between 40700 and 43391. I then trimmed the alignment to the suggested positions

$pcr.seqs(fasta=/home/panos/Audrey/test/silva.nr_v138.align, start=40700, end=43391, keepdots=F)

and it generates a file of 0 bytes!

Why is this happening? Is this a bug? I checked the primers

1391F
GTACACACCGCCCGTC
EukBr
TGATCCTTCTGCAGGTTCACCTAC

and they are definitely both in the 18S

thanks in advance

Happy Friday,
P

ps: here are some of the sequences (.trim.contigs.good.unique.fasta) if someone wants to try and reproduce the error

M00626_352_000000000-B9K95_1_2105_20306_3743
TGTTGTGACATTATAATTCATACATGATAGGACTATACGCGCGTCGGACATTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTGACGACATGGTTCTACAGTACACACCGCCCGTCGCTCCTACCGATACCGGGTGATCCGGTGAACCTTTTGGACTCGTAAGGGGAAGATAAGTAAACCATATCACCTAGAGGAAGGAGAAGTCGTAACAAGGTTTCCGTAGGTGAACCAGCAGAAGGATCAAGACCAAGTCTCTGCTACCGTATGCGAGACGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGATCTCTCATCTATTATCCACCTGTTTACCACTGATTGTA
M00626_352_000000000-B9K95_1_2105_15752_4492
GCATCTCTCTTTCTCCTGATCTTTTGTCACTAACACTACTACCTGTGTCTTTTTTTTTTGAATGATACGGCGACCACCGAGATCTACACTGACGACATGGTTCTACAGTACACACCGCCCGTCGCTCCTACCGATACCGGGTGATCCGGTGAACCTTTTGGACTCGTAAGGGGAAGATAAGTAAACCATATCACCTAGAGGAAGGAGAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCAAGACCAAGTCTCTGCTACCGTATGCGAGACGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAAATTGATTGTATGGCACTATAGCGAAGATGAGGCAAAT
M00626_352_000000000-B9K95_1_2105_20510_4544
CACAGTGTTCAATCTAACATACTATGCAATCGTGTCTCCTGACCACTTGGTATTTTTTTTAATGATACGGCGACCACCGAGATCTACACTGACGACATGGTTCTACAGCACACACCGCCCGTCGCTCCTACCGATACCGGGTGATCCGGTGAACCTTTTGGACTCGTAAGGGGAAGATAAGTAAACCATATCACCTAGAGGAAGGAGAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCAAGACCAAGTCTCTGCTACCGTATGCGAGACGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAATAAATAATGTTATTAGATCGCCAAGACAACACAATCCAA

Hi,

The problem is with the reference alignment. Because the 18S reference sequences don’t all span the full length of the gene, there is variation in where the sequences start and end. You might have a sequence that matches one reference sequence well, but when align.seqs aligns your sequence to that reference it chops off a lot of your sequence because the reference doesn’t have the full length covered by your sequence.

With 16S, our references align the same coordinates going from 27 to 1492. We couldn’t do this with 18S because the coverage of the gene is so varied that we would end up removing a lot of sequences if we picked coordinates to insure were represented by every sequence. In the READMEs, we recommend customizing the alignment to make sure that it overlaps with the region you are interested in. I would encourage you to take a full length reference that is aligned and then align one of your sequences to that reference to get the actual coordinates. Then you can align your full dataset to that reference.

Hope this helps a bit…
Pat

1 Like

OK cool, thats what I thought…
Well, that doesnt leave me much choice but to make my own alignment. I tried the PR2 which seems that it covers fully the region that R1 and R2 overlap but since it is unaligned I have to make my own alignment… :frowning:
The unique sequences in PR2 are apx 150k so I do not have the capabilities of building such an alignment de novo. I have now picked only the Alveolata which are 35k (since all my sequences fall within them) and I am trying to see if I can make alignments with mafft and muscle. If I succeed, I d be happy to upload the alignment in a repository that other people can use too.
Happy Sunday everyone

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.