seq.error

Dear All Motur Lovers
i have some blunt question,
In schloss SOP , i see the command seq.error
with my understanding we have to use a reference fasta file to comare our sequences for error, offcourse we have to filter mock community files according to filter parameters in analysis of our fasta files.
i am not getting what does following means, and where this out put group file will be used and what we have to do with it …
If we see the input in this command, fasta and groups files, and names are from ongoing analysis, for say from analysis me or any other person is performing while the groups= is from reference (Mock community), unable to understand what we are trying to get with get.groups commands here and why in my fasta and groups MOCK community should be present.

Next, we need to get the sequences that correspond to our mock community (the group name is MOCK.GQY1XT001):
mothur > get.groups(fasta=final.fasta, group=final.groups, groups=MOCK.GQY1XT001, name=final.names)

secondly i am unable to undetsant the calculations as stated, That’s right, we’ve reduced the sequencing error rate from ~0.8% to 0.007%.,
would be great if some one could splash some light on this

I’m not sure I understand your questions, but I’ll try to respond…

If we see the input in this command, fasta and groups files, and names are from ongoing analysis, for say from analysis me or any other person is performing while the groups= is from reference (Mock community), unable to understand what we are trying to get with get.groups commands here and why in my fasta and groups MOCK community should be present.

I’m sure there’s a question here, but I’m not sure what it is. You are using get.groups to pull out the sequences and names associated with the sequences from the Mock community. You need the group file to tell the command which sequences had the mock community barcode.

secondly i am unable to undetsant the calculations as stated, That’s right, we’ve reduced the sequencing error rate from ~0.8% to 0.007%.

The “~0.8%” comes from the PLoS ONE paper we wrote last year. The final 0.007% error is the error associated with the non-chimeric sequences. It’s the total number of sequenced bases that don’t match the reference divided by the total number of bases sequenced.

Hope this helps,
Pat

hello,there.I’ve got similarly puzzled by

’ get.groups(fasta=final.fasta, group=final.groups, groups=MOCK.HD, name=final.names)
MOCK.HD is not a valid group, and will be disregarded.
You provided no valid groups. I will run the command using all the groups in your groupfile.
Selected 172852 sequences from your name file.
Selected 13775 sequences from your fasta file.
Selected 172852 sequences from your group file.’ which copied from the logfile, where HD is my own file name.

it can’t find the MOCK.HD,then where do I find them? or should I creat them by some little portion of my final.fasta, how?

thx

Dear Schloss
Thanks for reply, although still i am unable to undersatnd why mock coummnity sequences or barcode should be present in my fasta files.
Please have a look, fatsa file is user file, and we are giving groups=mock, group=user group file
so in my group file neither mock is mentioned as a group nor my fasta file have sequences from mock community
accordiong to the command mock sequences should be selected from my fasta file, i guess this way command will not work
thats my question, why these sequences should be in my fasta file (user fasta file),
see the same problem from other user also
and secondly, at which point i am using these sequences, if its in seq.error command then best is to have these sequences as a fasta file like silva or rdp sequence fasta files, we could filter it (with hard=filter) and use it in seq.error command

so in my group file neither mock is mentioned as a group nor my fasta file have sequences from mock community
accordiong to the command mock sequences should be selected from my fasta file, i guess this way command will not work
thats my question, why these sequences should be in my fasta file (user fasta file),
see the same problem from other user also
and secondly, at which point i am using these sequences, if its in seq.error command then best is to have these sequences as a fasta file like silva or rdp sequence fasta files, we could filter it (with hard=filter) and use it in seq.error command

i’m sorry, but i really don’t know what you’re trying to ask.

This may seem a silly question to ateeqrr and lusilan, but are you working with the sample data given in the SOP or your own data? The way you’ve worded your questions makes it sound like you’re using your own data.

MOCK.GQY1XT001 is a sample included in the SOP practice data (Which you can easily see by either looking at the oligos file, or doing a count.groups() on the data). If you’re working with your own 454 data - and from the way you’ve worded your comments, it sounds like you are - there will not be a mock community unless you included one.

Thank you for your kind reply, it’s pretty precious for those freshmen who often ask kind of silly questions.yes, I was using my own data. It’s just abrupt for me here to see the MOCK.GQY1XT001,in my data should be MOCK.HD,because there seems to be not this file or its referring in the previous part in SOP.

how can I include one,if I may ask, could you please make this part in some detail. Thank you in so much indebted mood.

You would have to design and make your own mock community, which you would then PCR and pool with the rest of your samples. If you already have your sequence data, it’s too late :). One suggestion for the future would be to make a mock community by pooling clones from old 16S rRNA clone libraries that you’ve already got good sequence data for.

Dear Patrick,

I find seq.error a very important analysis tool for mock studies but I am afraid not a lot of people are aware of its existence, and it’s documentation is rather incomplete when compared to other mothur parts. I first want to thank you for developing such a tool that reveals many otherwise unknown aspects of the data.

I have a few questions about this subroutine.
Is it correct that for datasets presented in Schloss et al. 2011 PLoS ONE paper, you used seq.error to identify the “undetected chimeras” (those which were missed by Uchime etc)? Both seq.error outputs *.error.summary and *.error.chimera have annotations for reads indicating the number of parents. I was assuming that you used this information to filter out the false negatives of chimera detection algorithms, is it correct?

Can you also briefly explain the difference between the seq.error outputs *.error.seq.forward and *.error.seq.reverse? I also could not understand why not all query(read)-reference pairs were included in the output file *.error.ref-query, but rather a subset of read-reference alignments are presented there.

Best regards

Bump…

I think I know what he was asking. I’ll try to rephrase:

This is directly from 454 SOP:

Next, we need to get the sequences that correspond to our mock community (the group name is MOCK.GQY1XT001):

mothur > get.groups(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta, group=GQY1XT001.shhh.good.pick.pick.groups> , name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.names, groups=MOCK.GQY1XT001> )

He is trying to ask, why would we need to search file fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.pick.fasta (current analysis data) for barcodes MOCK.GQY1XT001, as there should be no “mock” barcodes in these files listed above?

He is also concerned that if he runs an analysis with his fasta=AAA.fasta, group=AAA.group, name=AAA.names, it also won’t contain any MOCK.AAA barcodes.

This is something I found in SchlossSOP:
There is a GQY1XT001.MOCK.GQY1XT001.shhh.groups file in SchlossSOP folder, for example. This file does contain “GQY1XT001.MOCK.GQY1XT001” groups in it.

GQY1XT001.oligos file also contain barcode:

barcode AACCGTGTC MOCK.GQY1XT001

Is there some information (steps) missing in 454 SOP regarding this barcode and subsequent analysis? Indeed, this barcode will put sequences starting with AACCGTGTC into MOCK.GQY1XT001 groups, so that we can get those sequences and groups later.

Any comments?
Thanks!

In 454 SOP, there are sequencing samples from mice fecal material (test samples) and mock community composed of genomic DNA from 21 bacterial and archaeal strains that were sequenced together in the same sff file.

All mock communities were sequenced with the barcode listed in GQY1XT001.oligos to put those mock community under a separate “group”.

barcode AACCGTGTC MOCK.GQY1XT001

Obviously, in custom 454 run if you did not sequence identical mock community listed in PLoS ONE 2011 paper, there is no way for you to estimate Specificity or Sensitivity (Error Rate section of 454 SOP) of your analyses.