own mock library sequences fasta file

Hi everyone,
I know that I am asking stupid questions, but I’m confused again with your HMP_MOCK.v35.fasta in the Miseq_SOP. I’m using my own mock library.

As the SOP indicated that HMP sequenced 21 isolates in the mock library to calculate the error rates. Regarding this, I do have a couple of questions and I hope that you can help with.

  1. Can you please let me know how did they generate this fasta file? How did they pick 16s sequence from a single species? I mean even for E.coli, there are 7 copies of 16s (rrsA-rrsE, rrsG and rrsH), which one should we chose? as far as know, those copies have different sequences. Should we grab those sequences from Genebank or silva?
  2. I saw that in the fasta file, they do have a couple of numbers, like B. vulgatus 1, B. vulgatus 2, B. vulgatus 4, B. vulgatus 5, B. vulgatus 7, what are those numbers? Different copies in the same strain? if this is the case, why e.coli only have E.coli 1?
  3. the SOP is working with V4, so the sequence is supposed to be 250bp in length? but Why are you putting V3-V5 sequences in this file? Did you use V3 and V5 primers to trim the 16s sequences and get this file?

I’m trying to explain where I was confused and hope it is clear. I was stuck with this step and really hope any of you will be able to help.


I got the answers for my questions, and if you are working with your own mock library and have the same questions, please let me know. thanks,

The fasta sequences came from the organisms’ genome sequences and includes all of the rrn copies per genome. I think the version we posted is only those sequences that were unique within the V3-V5 region. The full length sequences are available at https://raw.githubusercontent.com/SchlossLab/Kozich_MiSeqSOP_AEM_2013/master/data/references/HMP_MOCK.fasta

Can you please tell me te answer , me too got the same problem…

Sorry I’m late, the best recommendation is to start with the organisms whose genome is well known. From there, you should grab all the copy numbers from your organisms and trim them to your interested region and do the alignment from there. thanks,

Hi there,
I have created my own mock community using DNA from 10 separate bacterial strains purchased from BEI resources. The website gives information regarding the HMP ID and GenBank accession numbers so I am able to get the whole genome sequence for these particular strains. However, I am now unsure how to progress to make up my own fasta file like the ones created for the standard mock communities created by BEI (http://gigadb.org/dataset/view/id/100185) to be used in the Mothur protocol for assessment of error rate. Essentially, I want to download just the 16S sequence from the whole genome but don’t know how to do this in the most efficient way… Any suggestions? I have tried searching Silva etc to just get the curated ones already but I am worried that these sequences are not specific enough to the strains used and produced by BEI… Any thoughts on this?
Many thanks in advance! :smiley: