Mock community error rates with seq.error


Currently I am processing a mock community that was sequenced in triplicate. It consists of 19 different species from 6 phyla. Some are quite closely related, some are more divergent. They were pooled “evenly” based upon 16S copy number as determined with qPCR and are all fully sequenced strains. To verify purity sanger was performed on the genomic extracts.
Now I’m in a bit of a pickle on how to get some error rates here. The seq.error documentation page is rather poorly documented ( and I would like to know how I should proceed.

What did I try so far:

  1. Make a fasta file with the partial 16S of each (type) strain from either RDP or EMBL repositories
  2. Align this fasta with the SILVA v119 reference alignment
  3. Use pcr.seqs to trim to primer region of my actual mock data
  4. Run filter.seqs
  5. Run mock data through MiSeq SOP until point of error comparison

And here starts the pickle. I am supposed I have to set aligned=T ? Does this mean that both the fasta and the reference have to be aligned? I tried it and I get a whole bunch of errors complaining that they are not the same length. This is quite weird as I used pcr.seqs to trim to the correct primer region, should I include an extra screen. seqs after step 3? Is it okay to use the reference sequences from the databases, or would it be a better idea to use the sanger reads from the strains that we’ve put in (which are, theoretically, the same).

Comments & suggestions are warmly welcome :slight_smile:

Kind regards,


If you’ve filtered your data without including the mock community data and then filtered the mock without the real data, then you will likely have very different alignments. When you filtered your data, it should have generated a filter file ending in *.filter. Use that on your data as we did here:

Alternatively, you could just use aligned=F and it will strip out all of the gaps from both datasets and the error messages should go away.