choose start and end of pcr.seqs command

From the miseq SOP, I know it’s good to customize the silva.bacteria.fasta,to our own sequencing region by using pcr.seqs command.
Would you mind telling me how to decide the start and end point? I saw some examples. They use oligo files, but it doesn’t work since the silva database has no primers sequence unless genomes, then you told them to start=1044, end=13127, is this the same postion to all the other 16s rDNA by using primers?

Thanks a lot.

The 27f (aka 8f) and 1492r primers are not included in the database. The reference alignments go from 1044 to 43116. You should take something like E. coli’s 16S rRNA gene sequence and trim it to the primers you used. Then align the trimmed sequence with align.seqs against the reference. Then use the output of that as input to summary.seqs. That will give you the proper coordinates.

Also, I’d be very hard pressed to pick anything other than the V4 region at this point…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Pat

I just tried to use the start=10000, and end=45000, it took several hours, then I summary seqs, the outcome indicated start=10357, end=28
464 (515F/907R primers are used). The pre.cluster and chimera.uchime can only use 1 processor to deal with the data, it really took a long time to run it . By now, it took almost 10 hours, but only finish 27%, I think something was wrong here.And almost 24% chimeras were found now. :frowning:

According my try last time, it will take more than 80 hours to finish the chimera.uchime commands.

If you have a bunch of reads and only one sample, all of those numbers and experiences sound right. You really should read this to better understand what’s going on - your data are likely pretty crummy because the reads don’t overlap…

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Pat

I have 25 samples with universal primers but with different barcodes to differ these samples.
After reading your link, The idea that the mismatches of the two ends by forward and reverse primers will exagerate the number of chimeras. Am I right? (I was told that the removal of the chimeras might have less impact on the relative abundance, :lol: lucky)
In order to get good quality data, is it adivised that we sequence a shorter length of 16s RNA genes by using current illumina?
Is it possible that we remove the replicates first (the removal number should be recorded for latter assessment),then make contigs?


[quote="pschloss"] If you have a bunch of reads and only one sample, all of those numbers and experiences sound right. You really should read this to better understand what's going on - your data are likely pretty crummy because the reads don't overlap...

http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/

Pat
[/quote]

The only thing I suggest people use at this point is the v2 chemistry with the 500 cycle kit (2x250 nt) to sequence the V4 region. Anything else is just asking for problems.

Pat

Our research group sequence V4 and V5, which is about 400bp. It seems to be ok then. At first, I didn’t use the oligo file in the make.contigs commands, which is wrong. Then the step to assess the error rate by using get.groups and seq.error command can be tried by using our own data, since we don’t have Mock groups. And I wonder, could we still follow the SOP to dist.seqs and cluster.split or latter analysis? :o

You can carry the SOP through like we do for the V4 but for your V4V5 data. But what I’m telling you is that your error rate is going to be very high. If you’re happy with low quality data, go for it :).

yes, the longer length of these sequences, the higher error rate. Some parmeters in the Miseq SOP may not fit for personnal data.

I used the silva.119v directly for classify.seqs, which also may be wrong. I should use the RDP reference files as reference.

By the way, if mothur shows some errors, then “logout” , process completes, does that mean I should retry latter.



[quote="pschloss"] You can carry the SOP through like we do for the V4 but for your V4V5 data. But what I'm telling you is that your error rate is going to be very high. If you're happy with low quality data, go for it :). [/quote]

I get it that we can find the start/end columns by munging a particular reference gene (such as E. coli 16S) for our particular primers. BUT…at this point there are a handful of “standard” primer pairs. Surely someone has already done the groundwork.

Pat says above that 27f (aka 8f) and 1492r primers go from 1044 to 43116. One can infer from the SOP that the V4 region is 11894 to 25319.

Is there a table somewhere saying what the start/end pairs are for common primer pairs, or “standard” named regions?

I still think it is good to use longer start and end. I tried several times, at first, a relative longer distance one, then summary, and figure out the start and end position, but it doesn’t work for me to shorten the distance in next time with the same data. It won’t take long to choose longer distance between start and end. The above start and end position is not ok for V4 and V5 for my data, I still try longer distance.
By the way, I don’t know how to use the Ecoli data as reference alignment.
Also, you can try that 11894 and 25319, and summary to see whether the length is for your primers.

Not that I know of. This is also part of my scheme to reduce the number of times that I have to post that blog post when people try things other than V4 :).

Pat

are all the sequence of E-coli can be used as reference to know the start and end point? It would be nice of you if you offer a link to that.
When I use mac version mothur, there was no log file formed. So hard to follow the miseq SOP by using our own data. :o