Discrepancy between number of reads and group counts from make.contigs


I believe I am missing some understanding behind part of the make.contigs command.

I have 50 paired-end reads and anywhere from ~100k-900k reads per sample. For reference, the first contig that gets made has just over 400k reads while the last contig that gets made has just under 200k (output from mothur logfile: It took 153 secs to assemble 406814 reads, It took 19 secs to assemble 191760 reads.).

My area of confusion is that when the group count gets printed in the logfile, the values are quite low for a majority of the samples, but the last 7 samples have counts that seem to be in the correct order of magnitude (e.g.,

Group count: 
sample1 716
sample2 676
sample3 856
sample4 540
sample5 719
sample6 805
sample7 1644
sample8 1409
sample44 85127
sample45 181076
sample46 191916
sample47 133949
sample48 175103
sample49 100820
sample50 151699)

For additional reference, the command I ran is below.

make.contigs(file=PD.files, maxambig=0, oligos=oligos, pdiffs=0, maxlength=300, maxhomop=8, align=needleman, checkorient=true, format=illumina1.8+, match=1, mismatch=-1, gapopen=-2, gapextend=-1, insert=20)

Any info is greatly appreciated.

Hi - I’m a bit confused as well. You say that you have…

  • “50 paired end reads” - do you mean 50 samples?
  • “100k-900k reads per sample” - ok
  • “the first contig that gets made has just over 400k reads” - each contig should only have 2 reads

Can you clarify on those points?


  • What PD.files looks like? Does it have 50 lines or just 1?
  • What does oligos look like?

If I had to guess, once we resolve the confusion over terminology, I suspect things are getting tossed because they don’t match the values in oligos. I also wonder whether some of your samples were sequenced with one region and others with another region. I’d suspect the lower-numbered samples are usually just above 300 nt long.

To test, I’d encourage you to set up PD.file and oligos to run sample1 without the other samples. Then run this…

make.contigs(file=PD.files, oligos=oligos)

If it says it assembled a much larger number of reads into contigs than you get out in summary.seqs, then you’ll want to look in the scrap file to see the codes next to the names to get a better sense of why things are being scrapped.

I’d then go back and add maxambig=0, then pdiffs=0 (seems too conservative to me, but oh well), etc. one at a time to see where you are losing the most sequences. Feel free to post any output from summary.seqs - the first one I showed above would be particularly helpful.


Apologies for the lack of clarity!

Yes - 50 samples and I should have said that the first sample has ~400k reads; I was just trying to illustrate here that the samples have a similar number of reads but the resulting group count is vastly different.

PD.files has 50 lines and the contents of the oligos file are copied below.

I believe you are right in that things are getting tossed because of the inclusion of the oligos file; I ran the same sample through the make.contigs and summary.seqs commands above and when the oligos file is included, of the ~400k processed sequences, the group count is 764, however, when I remove the oligos parameter, the group count matches the number of processed sequences. Clearly, the primers I provided are not present in many of the sequences for this sample (confirmed via the scraps file, which is empty when the oligos file is not included and obviously has a ton of lines when it is included).

I retrieved all of my sequences from NCBI and the associated manuscript states ‘Hypervariable region 4 (V4) of the
bacterial/archaeal 16S rRNA gene was PCR amplified using primers 515F
3′) and sequenced using Illumina MiSeq.’

All this is to say, I am not sure what is wrong with the oligos file, but any insight here would be greatly appreciated (e.g., should I remove the oligos parameter completely or is there an adjustment you can suggest that may help? It is definitely possible I generated the oligos file incorrectly).

I pasted the summary.seqs output from both the oligos and non-oligos make.contigs runs below for additional context.

mothur > summary.seqs(fasta=current)
Using PD_no_oligos.trim.contigs.fasta as input file for the fasta parameter.
Using 8 processors.
                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       251     251     0       3       1
2.5%-tile:      1       292     292     0       4       10171
25%-tile:       1       293     293     0       4       101704
Median:         1       293     293     0       4       203408
75%-tile:       1       293     293     0       4       305111
97.5%-tile:     1       293     293     8       5       396644
Maximum:        1       502     502     60      249     406814
Mean:   1       292     292     0       4
# of Seqs:      406814
It took 1 secs to summarize 406814 sequences.
mothur > summary.seqs(fasta=current)
Using PD.trim.contigs.fasta as input file for the fasta parameter.
Using 8 processors.
                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       251     251     0       3       1
2.5%-tile:      1       252     252     0       3       20
25%-tile:       1       253     253     0       4       192
Median:         1       253     253     0       4       383
75%-tile:       1       253     253     0       4       574
97.5%-tile:     1       253     253     1       5       745
Maximum:        1       253     253     6       6       764
Mean:   1       252     252     0       4
# of Seqs:      764
It took 0 secs to summarize 764 sequences.

Thank you for the help!

Hey Jesse - Thanks for the update! A couple of ideas…

  1. Could you run make.contigs with the oligos file again. Then fish out of the scrap file a fasta sequence that looks good other than the primers? There should be a code next to the name like f or r corresponding to the forward or reverse primer that it can’t find. I think something like this should help get you a count of all the ways files got scrapped
grep ">" data.scrap.contigs.fasta | cut -f 2 -d "|" | sort | uniq -c
  1. What happens when you use pdiffs=2 as an argument in make.contigs with the oligos file?

  2. Returning to the fasta sequence you fished out of the scrap file, can you see what looks like it is probably the correct primer sequences? That record looks like they pooled data from two studies in the upload. I wonder if there may have been a subtle change in the primer sequences that they used between the two studies that they overlooked.


Hi Pat,

Thanks for the quick response.

  1. I am not 100% sure if this is what you are looking for here, but I’ve copied a single sequence from the scrap.fasta and one from the contigs.fasta for comparison; clearly some of the sequences are missing the primers completely but I don’t have a good reason for why.
>SRR10911767.254552     ee=0.00866764   fbdiffs=0(match), rbdiffs=0(match) fpdiffs=1(match), rpdiffs=0(match) 
>SRR10911767.51226 | f(f)(bf)   ee=0.0182898    fbdiffs=1000(noMatch), rbdiffs=1000(noMatch) fpdiffs=9(noMatch), rpdiffs=1002(noMatch)
  1. Using pdiffs=2 seems to help significantly (406814 sequences gets reduced to 403038 for example)

  2. Again, I should have been a little clearer here, my apologies. You are correct in that they pooled data from two studies, however, I only used what they deem dataset_2, which accounts for 507 samples that the authors sequenced themselves and used those primers that I pasted above.

I think in general, it makes sense to allow for some primer differences and it seems like that allows us to retain the majority of the sequences (at least when applied to one sample; I will likely test this with a larger sample size to see how that goes).

Let me know what you think!

Hey Jesse- is it possible some of the sequences getting tossed are reverse complements? You might take one of the sequences that bonked and try blasting it against NCBI to see what it says it is


Hi Pat,

I just blasted a couple of the scrap sequences and they both came back as a partial matches for various bacterial species.

I think the fix in general may just be to allow for some number of primer differences and whatever is lost beyond that is lost, as hypothetically speaking, the primers I am specifying in the oligos file should be present in all sequences.

Happy to hear your opinion.