Problem with the start position after align.seqs

Hi, I’m new to using mothur and, actually, analyzing sequence data. As I’ve read a million times in this forum it is clearly a mistake to work with the V3-V4 region; unfortunately, it was too late to learn this when I got this data. So, I analyzed it by following the manual and forum, but, I have a problem with these ridiculously different start positions after alignment.

This is the situation after make.contigs:

                 Start	 End	  NBases	 Ambigs   Polymer   NumSeqs

Minimum: 1 250 250 0 3 1
2.5%-tile: 1 274 274 0 4 143919
25%-tile: 1 298 298 0 4 1439188
Median: 1 442 442 0 5 2878375
75%-tile: 1 465 465 0 6 4317562
97.5%-tile: 1 466 466 2 31 5612831
Maximum: 1 500 500 75 250 5756749
Mean: 1 401.079 401.079 0.221197 7.30846

of Seqs: 5756749

After that, I run screen.seqs :

screen.seqs(fasta=stability.trim.contigs.fasta,minlength=10,maxlength=470,maxambig=0,maxhomop=8)

	      Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 250 250 0 3 1
2.5%-tile: 1 274 274 0 4 125262
25%-tile: 1 298 298 0 4 1252620
Median: 1 441 441 0 5 2505240
75%-tile: 1 465 465 0 6 3757860
97.5%-tile: 1 466 466 0 6 4885218
Maximum: 1 470 470 0 8 5010479
Mean: 1 398.46 398.46 0 4.82017

of Seqs: 5010479

Then, I run unique.seqs, create a count table, and run align.seqs with the tailored reference fasta that I produced from “silva.bacteria.fasta” by trimming it at the start position = 3388 and the end position= 25316.

And, I got this:

	        Start	End	        NBases	Ambigs	Polymer	NumSeqs

Minimum: 0 0 0 0 1 1
2.5%-tile: 1 18929 7 0 2 125262
25%-tile: 1 18929 298 0 4 1252620
Median: 1 18929 441 0 5 2505240
75%-tile: 3979 18929 465 0 6 3757860
97.5%-tile: 8578 18929 466 0 6 4885218
Maximum: 18929 18929 470 0 8 5010479
Mean: 1874.55 18718 387.523 0 4.70551

of unique seqs: 2887389

total # of seqs: 5010479

Well, as I understand it is not how it should be, and I do not know how I can fix this. What should I do?

Thanks for all your time and help in advance.

E.T.

Update:

I tried to set “minlength” parameter of screen.seqs to 390 bp and run it again. After, I followed the manual as I did before. The result I’ve got confused my mind even more. This is the summary of align.seqs:

	Start	End	NBases	Ambigs	Polymer	NumSeqs

Minimum: 1 14 3 0 1 1
2.5%-tile: 1 18928 439 0 4 80931
25%-tile: 1 18928 441 0 4 809302
Median: 1 18928 459 0 5 1618603
75%-tile: 1 18928 464 0 6 2427904
97.5%-tile: 1 18928 465 0 6 3156275
Maximum: 18910 18928 469 0 8 3237205
Mean: 26.8008 18798.8 450.293 0 4.92839

of unique seqs: 2048792

total # of seqs: 3237205

As I search for it but I could not find any explanation. I used the same reference fasta as I used before. What does it mean?

Thanks for your time and help again.

E.T.

For what you have here, your core reads are between 1 and 18928. All looks OK.

Are you aligning to the whole 18S, or to a reference with just V3V4. It might help with the alignment… But, as you have it now, you have a bit over 3M reads, 2M unique, with start=1 and end=18928.

Leo

Thanks for your reply.

I am aligning illumina 16s v3v4 region data (screen.seqs was applied on the raw data with the parameters: maxhomop=8, maxambig=0, minlength=390 and maxlength= 470 to produce it) by using the reference fasta which I produced by trimming “silva.bacteria.fasta” file at start position=3388 and end position=25316 for the coverage 16s v3v4 region.

I really do not know whether it is normal to get a start position “1” after alignment or if it is, what it means.

I have strong feelings that I did something wrong because no other example is present to get a start position at “1” after align.seqs.

Did I do something wrong if it is, what is it?

Thanks for your time.

After a week, I think I understand the situation. Sorry for taking your time.

If I am correct, this is the story.

The first problem I originally asked for help with was the start positions of the reads after alignment. Some of them were starting on “1” and some of them were starting on “3979” and “8578”.

Suddenly, I realized that I was working with the 16s V3-V4 region, and the reads in the dataset are supposed to be ~ 460 bp. So, I run the screen.seqs command on the reads not only by the max length but also by the minlength and eliminated the reads shorter than 390 bp. When I aligned the sequences, all the sequences started at position “1”.

The second question I was asking was related to the sequences starting at position “1”. Because this is the first time I am dealing with bioinformatics, and I had never seen the sequences that started at the position “1” in the tutorials and examples I found, I thought I had done something wrong.

After days, I could finally think that it may be helpful to investigate the fastq files used for the examples and I saw that my sequences have primer sequences at the beginning and the sequences in the example fastq have not. So possibly they are aligning by beginning at the first position as they should be, there is no problem, just me being an anxious newbie.

Thanks for your help.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.