Deblur not carrying forward all reads

Hello,

This might be a bug, or me not understanding exactly how to set the commands… But I am very confused.

I am using

mothur v.1.43.0
Last updated: 11/14/2019

In CentOS Linux 7 (Core)

In the step with deblur, there is a mistach in the the number of reads at the beginning and at the end. I tried both:

pre.cluster(fasta=current, name=current, diffs=1, method=deblur)

and

pre.cluster(fasta=current, count=current, diffs=1, method=deblur) after making a count table.

The fasta points to the last unique fasta file, and the name is the last namefile (obviously)

As a result, there are 50% of the reads missing after this step in the count.table created by this step. It uses the alignment from the unique file, but apparently it does not carry forward the counts from the namefile or the count.table. The new count of reads is larger than the previous number of uniques (so, it is reading the original count table) but, half of the reads (in number) of the total number of reds are missing.

Cheers,

Leo

Hi!

Have you run screen.seq, remove.seqs, or one of the chimera checking tools between when the name file and count file were created?

Also, FWIW, I strongly discourage anyone from using deblur… It is trained on data of very poor quality (1x150) and will likely be overly aggressive in pooling sequences if you have data that are better than what it was trained with.

Pat

Hi Pat

Thanks for the answer.

No, I haven’t. I run chimera and remove.seqs, but giving it the name file, so I have a name file created after chimera, by remove.seqs. I also checked my count file (before pre.cluster) and has the 20+million total reads counted. And the lines match the number of unique reads (so I assume it is ok), and doing summary.seqs with the align file with either the name or the count file gives me the right number of uniques and total.

I am running now a “short” script (just a initial summary seqs, then pre.cluster, then a second summary seqs with my align and my count files). I will post it here once it is done (tomorrow-ish?). Since I re-started from the alignment step (I thought the problem was the name file, but it was not - checked length, etc), I did not added an initial summary.seqs before the pre.cluster (did it now and reads OK, but prefer to send you a single logfile and not a collage of logfiles!).

About deblur: I was doing something very similar before using opticlust and giving it the equivalent to 2 bp for clustering (the distances between the originated OTUs would be as low as 1 bp). I assumed that opticlust does a better job, but deblur is faster? and then leaving that as 100% similarity OTUs (what some people now call single variants…).

I will post the log file with the summary.seqs - deblur - summary.seqs once it is done.

Thanks again,

Leo

It’s been a while since I looked at deblur, but I don’t think the diffs should matter with the algorithm.

Hello Pat

Here is the logfile. Not sure why, but drops ~9M reads… I just fed the software with the alignment and the count file. It recognizes the 2M unique and 17M total,
but after deblur, it has killed 9M reads… I think the problem happens at get.seqs?

I hope I am not misunderstanding how this works…

Thank you very much for your help,

Leo

(Attachment mothur.1573834577.logfile is missing)

Sorry, I replied via email and rejected the attachment. Here is the logfile pasted

Linux version

Using ReadLine
mothur v.1.43.0
Last updated: 11/15/2019
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://forum.mothur.org

Type 'quit()' to exit program

[NOTE]: Setting random seed to 19760620.

Batch Mode


mothur > summary.seqs(fasta=protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.align, count=protist.trim.contigs.good.good.unique.pcr.good.good.good.pick.count_table)

Using 40 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	8576	309	0	3	1
2.5%-tile:	1	8576	369	0	5	434158
25%-tile:	1	8576	381	0	5	4341573
Median: 	1	8576	382	0	5	8683145
75%-tile:	1	8576	385	0	6	13024717
97.5%-tile:	1	8576	388	0	6	16932132
Maximum:	1	8576	461	0	10	17366289
Mean:	1	8576	382	0	5
# of unique seqs:	2452562
total # of seqs:	17366289

It took 692 secs to summarize 17366289 sequences.

Output File Names:
protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.summary


mothur > pre.cluster(fasta=current, count=current, diffs=1, method=deblur)
Using protist.trim.contigs.good.good.unique.pcr.good.good.good.pick.count_table as input file for the count parameter.
Using protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.align as input file for the fasta parameter.

Using 40 processors.
When using running without group information mothur can only use 1 processor, continuing.
Total number of sequences before precluster was 2452562.
pre.cluster removed 2304452 sequences.

/******************************************/
Running command: get.seqs(fasta=protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.align, accnos=protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.precluster.align.temp)
Selected 124868 sequences from your fasta file.

Output File Names: 
protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.pick.align

/******************************************/
Done.
It took 76481 secs to cluster 2452562 sequences.

Output File Names: 
protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.precluster.align
protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.precluster.count_table


mothur > summary.seqs(fasta=current, count=current)
Using protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.precluster.count_table as input file for the count parameter.
Using protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.precluster.align as input file for the fasta parameter.

Using 40 processors.

		Start	End	NBases	Ambigs	Polymer	NumSeqs
Minimum:	1	8576	309	0	3	1
2.5%-tile:	1	8576	369	0	5	236068
25%-tile:	1	8576	381	0	5	2360673
Median: 	1	8576	382	0	5	4721346
75%-tile:	1	8576	386	0	6	7082018
97.5%-tile:	1	8576	388	0	6	9206623
Maximum:	1	8576	461	0	10	9442690
Mean:	1	8576	382	0	5
# of unique seqs:	124868
total # of seqs:	9442690

It took 36 secs to summarize 9442690 sequences.

Output File Names:
protist.trim.contigs.good.good.unique.pcr.good.good.good.unique.pick.precluster.summary


mothur > quit()


It took 77890 seconds to run 4 commands from your batch file.

Hey Leo

I’ve never tried deblur. It looks like deblur is ignoring your count file?

Kendra

A couple of things…

  • The deblur algorithm does not conserve the number of reads. It removes reads that it thinks are bad or that are more abundant than it thinks they should be. The algorithm is described in the supplement to the original paper. I suspect that most of your rare sequences are getting tossed leading to the significant reduction in the number of sequences in your dataset
  • It appears that you only have one sample in your study. The algorithm works on each sample independently. If you intended to have multiple samples, then you might want to go back and have a look at what’s going on earlier in the pipeline.

I really would discourage the use of deblur.

Thanks for the answers

Ouch - did not read the part that sequences were removed… I will move to something more conservative then.

Kendra - Patt is right - it is dropping half of my reads.

Thank you very much, and apologies for not reading properly the papers where this algorithms are described.

Best,

Leo

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.