Discrepancy between fasta and group counts

Hi,
I am observing a strange discrepancy between the number of sequences in my fasta file and the number in the group file. It seems to be present right from the start of my analysis “pipeline”, already in the output of the make.contigs command:

make.contigs(file=bf11_files.txt, processors=4)
[…]
Total of all groups is 13447322
Output File Names:
bf11_files.trim.contigs.fasta → wc -l = 26894588, i.e. 13447294 sequences
bf11_files.trim.contigs.qual
bf11_files.contigs.report
bf11_files.scrap.contigs.fasta → 0 lines
bf11_files.scrap.contigs.qual → 0 lines
bf11_files.contigs.groups → count.groups() outputs 13447322 sequences for the 48 groups (28 sequences more than in the fasta)

I have gone further down the analysis pipeline and this difference of 28 sequences seems to be kept at the following steps (pcr.seqs, screen.seqs…).

How could this discrepancy be explained? Is it a mistake to count lines by doing “wc -l” on the fasta file (and divide the result by 2) to get the number of sequences?

Maxime

Could you post the command you ran and what version of mothur you are using?

Hi Sarah,
I am currently using mothur 1.39 on Windows (because I have not been able yet to update manually the version of mothur that is included in the Bio-Linux distribution).

This is the logfile of the session during which I realized this discrepancy. I removed the lines corresponding to the group counts to make it shorter.

Windows version

Running 64Bit Version

mothur v.1.39.0
Last updated: 1/23/2017

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type ‘help()’ for information on the commands that are available

For questions and analysis support, please visit our forum at > https://www.mothur.org/forum

Type ‘quit()’ to exit program
Interactive Mode


mothur > get.current()

Current RAM usage: 4.68532 Gigabytes. Total Ram: 31.9458 Gigabytes.

Current default directory saved by mothur: E:\Sequencing_data\Bourg-Fidele_data\2017-01-26_Eleventh_try\

Current working directory: E:\Sequencing_data\Bourg-Fidele_data\2017-01-26_Eleventh_try\

mothur >
make.contigs(file=bf11_files.txt, processors=4)
[…]
It took 4505 secs to process 13447322 sequences.

Group count:
[…]
Total of all groups is 13447322

Output File Names:
bf11_files.trim.contigs.fasta
bf11_files.trim.contigs.qual
bf11_files.contigs.report
bf11_files.scrap.contigs.fasta
bf11_files.scrap.contigs.qual
bf11_files.contigs.groups

[WARNING]: your sequence names contained ‘:’. I changed them to ‘_’ to avoid problems in your downstream analysis.

mothur >
count.groups(group=current)
Using bf11_files.contigs.groups as input file for the group parameter.
[…]
Total seqs: 13447322.

Output File Names:
bf11_files.contigs.count.summary


mothur > pcr.seqs(fasta=current, oligos=primers_16S.oligos, group=current, pdiffs=0, nomatch=reject, keepprimer=true) Using bf11_files.trim.contigs.fasta as input file for the fasta parameter. Using bf11_files.contigs.groups as input file for the group parameter.

Using 4 processors.
Removed 9723566 sequences from your group file.

Output File Names:
bf11_files.trim.contigs.pcr.fasta
bf11_files.trim.contigs.bad.accnos
bf11_files.trim.contigs.scrap.pcr.fasta
bf11_files.contigs.pcr.groups


It took 956 secs to screen 13447294 sequences.

mothur >
count.groups(group=current)
Using bf11_files.contigs.pcr.groups as input file for the group parameter.
[…]
Total seqs: 3723756.

Output File Names:
bf11_files.contigs.pcr.count.summary


mothur > remove.groups(fasta=current, group=current, groups=S09-S48) Using bf11_files.trim.contigs.pcr.fasta as input file for the fasta parameter. Using bf11_files.contigs.pcr.groups as input file for the group parameter. Removed 377 sequences from your fasta file. Removed 377 sequences from your group file.

Output File names:
bf11_files.trim.contigs.pcr.pick.fasta
bf11_files.contigs.pcr.pick.groups


mothur > count.groups(group=current) Using bf11_files.contigs.pcr.pick.groups as input file for the group parameter. [...] Total seqs: 3723379.

Output File Names:
bf11_files.contigs.pcr.pick.count.summary


mothur > screen.seqs(fasta=current, group=current, maxambig=0, maxhomop=10, minlength=400) Using bf11_files.trim.contigs.pcr.pick.fasta as input file for the fasta parameter. Using bf11_files.contigs.pcr.pick.groups as input file for the group parameter.

Using 4 processors.

Output File Names:
bf11_files.trim.contigs.pcr.pick.good.fasta
bf11_files.trim.contigs.pcr.pick.bad.accnos
bf11_files.contigs.pcr.pick.good.groups


It took 89 secs to screen 3723351 sequences.

mothur >
count.groups(group=current)
Using bf11_files.contigs.pcr.pick.good.groups as input file for the group parameter.
[…]
Total seqs: 2282158.

Output File Names:
bf11_files.contigs.pcr.pick.good.count.summary


mothur > summary.seqs(fasta=current) Using bf11_files.trim.contigs.pcr.pick.good.fasta as input file for the fasta parameter.

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 400 400 0 3 1
2.5%-tile: 1 440 440 0 4 57054
25%-tile: 1 440 440 0 5 570533
Median: 1 442 442 0 5 1141066
75%-tile: 1 464 464 0 6 1711598
97.5%-tile: 1 466 466 0 7 2225077
Maximum: 1 500 500 0 10 2282130
Mean: 1 449.212 449.212 0 5.39261

of Seqs: 2282130

Output File Names:
bf11_files.trim.contigs.pcr.pick.good.summary

It took 42 secs to summarize 2282130 sequences.

The discrepancy of 28 sequences seems to be there right after the make.contigs (because the “wc -l / 2” on the fasta is 13447294 whereas the count.groups() is 13447322) and to maintain afterwards (after pcr.seqs, screen.seqs…).
Yours,
Maxime

Hi,
I have manually checked the numbers of reads in individual FastQ files (48 pairs of one forward file and one reverse file), by doing wc -l (and divide by 4) on each file and the numbers correspond perfectly to the counts in the group file (total number 13447322). But when I do summary.seqs on the fasta file (created by make.contigs), it has only 13447294 sequences.
So, it seems that the fasta file somehow misses 28 sequences. But why???
Maxime

How many sequences are in bf11_files.trim.contigs.pcr.fasta?

mothur > summary.seqs(fasta=bf11_files.trim.contigs.pcr.fasta)

How many sequences in bf11_files.trim.contigs.bad.accnos? The group file after pcr.seqs indicates this should be 9723566. Is it?
bf11_files.contigs.pcr.groups = Total seqs: 3723756

Unfortunately I don’t have access to my “bioinformatics workstation” and I am not able to try everything that you suggest. Is it correct to use wc -l to count sequences in the files (dividing by 2 for a fasta, by 4 for a fastq…)?

However, I have some of the information:
bf11_files.trim.contigs.pcr.fasta -> wc -l = 7447456, i.e. 3723728 sequences, whereas count.groups on the corresponding group-file returns 3723756 sequences (again 28 more than the fasta).

Concerning the accnos, I can’t tell you for now (very likely not before Monday). Would it be of any use that I send you the raw data (or rather, put them on a server)? After all, they are not that huge…

Yours,
Maxime

Dear Sarah,
Concerning the whole set of sequences (16S + ITS) produced by make.contigs(), here is the output (mothur 1.36.1 on Bio-Linux):

mothur > summary.seqs(fasta=bf11_files.trim.contigs.fasta, processors=4)

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 35 35 0 2 1
2.5%-tile: 1 44 44 0 4 336183
25%-tile: 1 255 255 0 5 3361824
Median: 1 272 272 0 5 6723648
75%-tile: 1 440 440 1 5 10085471
97.5%-tile: 1 465 465 14 8 13111112
Maximum: 1 502 502 93 250 13447294
Mean: 1 311.318 311.318 1.30977 5.35333

of Seqs: 13447294

Output File Names:
bf11_files.trim.contigs.summary

It took 237 secs to summarize 13447294 sequences.

It’s in good agreement with the number of sequences in the fasta file:

wc -l bf11_files.trim.contigs.fasta
26894588 bf11_files.trim.contigs.fasta ->13447294 sequences

But it’s not in agreement with the group-file, for which wc -l or count.groups() both yield 13447322 sequences, nor with the sum of the numbers of sequences in the fastq files (also 13447322). Both bf11_files.scrap.contigs.fasta and bf11_files.scrap.contigs.qual produced by make.contigs() are empty files (0 bytes). For the report and qual files:

wc -l bf11_files.contigs.report
13447323 bf11_files.contigs.report → the extra line is just the header

wc -l bf11_files.trim.contigs.qual
26894586 bf11_files.trim.contigs.qual → 13447293 sequences (so 1 less!)

Then I do pcr.seqs() to separate the 16S and ITS sequences (keep 16S and reject ITS), and the number of sequences in bf11_files.trim.contigs.bad.accnos is in agreement with the output of pcr.seqs():

Removed 9723566 sequences from your group file.

wc -l bf11_files.trim.contigs.bad.accnos
9723566 bf11_files.trim.contigs.bad.accnos

wc -l bf11_files.trim.contigs.scrap.pcr.fasta
19447132 bf11_files.trim.contigs.scrap.pcr.fasta → 9723566 sequences removed

But again there is a discrepancy betwwen the fasta and the group file:

count.groups(group=current)
Using bf11_files.contigs.pcr.groups as input file for the group parameter.
[…]
Total seqs: 3723756.

wc -l bf11_files.contigs.pcr.groups
3723756 bf11_files.contigs.pcr.groups

mothur > summary.seqs(fasta=bf11_files.trim.contigs.pcr.fasta)

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 37 37 0 3 1
2.5%-tile: 1 439 439 0 4 93094
25%-tile: 1 440 440 0 5 930933
Median: 1 441 441 0 5 1861865
75%-tile: 1 460 460 1 6 2792797
97.5%-tile: 1 466 466 6 8 3630635
Maximum: 1 500 500 46 231 3723728
Mean: 1 445.693 445.693 0.851826 5.41634

of Seqs: 3723728

Output File Names:
bf11_files.trim.contigs.pcr.summary

It took 73 secs to summarize 3723728 sequences.

Yours,
Maxime

Could you have sequences with duplicate names? You can check this with the list.seqs and get.seqs commands.

mothur > list.seqs(fasta=bf11_files.trim.contigs.fasta) - list names of sequences in file
mothur > get.seqs(fasta=current) - selects all the sequences in the accnos file. If there are duplicate names read mothur will output a warning and not add the duplicate to the *.pick.fasta file.

I tried your suggestion and it went without a warning:

mothur > list.seqs(fasta=bf11_files.trim.contigs.fasta)

Output File Names:
bf11_files.trim.contigs.accnos


mothur > get.seqs(fasta=current) Using bf11_files.trim.contigs.fasta as input file for the fasta parameter. Using bf11_files.trim.contigs.accnos as input file for the accnos parameter. Selected 13447294 sequences from your fasta file.

Output File Names:
bf11_files.trim.contigs.pick.fasta

I checked the numbers of sequences in the fastq files by counting the occurrences of the word “AA2UK” (because what all read numbers start with) and again I got results equal to the numbers in the group file. It really seems that something is missing from the fasta file, but what?
Maxime

Thanks for the extra information. Did you receive any error messages while running pcr.seqs? Did you use multiple processors?

Yes, I did use processors=4 because I have a quad-core CPU. No, I did not get any error message while running pcr.seqs.
Shall I send you the full mothur logfile, at the mothur-bug e-mail adress?
Maxime

Yes, please send your files so I can help.

Sorry for the delay, I have just sent you the logfile.
Maxime

Hi,

I’m having kind of the same issue. In my case, I see a discrepancy between the number of sequences between .trim.contigs.fasta and .contigs.groups. In other words, right after having run make.contigs.

I’ve just executed it as in the Miseq SOP tutorial:
make.contigs(file=/<path>/analysis_mothur/samples.files, inputdir=/<path>/analysis_mothur/reads, processors=16)

.trim.contigs.fasta and .contigs.groups contain 987903 and 915039, respectively. When checking for exactly the same sequences with grep ">" -v .trim.contigs.fasta | sort -u | wc -l, I get 221846.

It seems then the issue comes from the very beginning.

I’m running mothur v.1.40.4 in Linux (CentOS 7).

Thanks

Hi,

Can you try again with the latest version of mothur and report back to a new thread (this one is 3 yrs old)?

Pat

Hi @pschloss,
I installed latest version and still have this issue.

Thanks

We sometimes see this when you have a duplicate file pair in the files file. The duplicate file pair will cause duplicate reads in the fasta file. The *.fasta file will contain duplicates, but the group file won’t because of how they are created. Can you try running the following to see if this is the case:

mothur > list.seqs(fasta=yourFastaFile) - list all seqs in fasta file
mothur > get.seqs(fasta=current) - select the seqs, ignoring duplicates

If this is not the cause of the issue, can you send your logfile and a link to your fastq files to mothur.bugs@gmail.com?