File error right from the start

jhkeet · May 21, 2020, 7:44am

mothur v.1.44.1 running of Windows 10 64 bit

Hi all,

I am getting errors right from the very start of my pipeline. The Illumina files I received from the service provider are the typical run-of-the-mill fasta and qual files, and mapping files.

I open mothur, and run the following commands:

set.current(processors=1, fasta=040920SWnodC_full.fasta, qfile=040920SWnodC_full.qual, oligos=040920SWnodC.oligos)

trim.seqs(fasta=current, qfile=current, oligos=current, qaverage=25, pdiffs=2, bdiffs=2)

And then just errors all the way… specifically stuff like this (many of them):

“[ERROR]: In sequence M02542_32_000000000-J47WM_1_1101_15286_1199’s quality scores, expected a number and got >M02542:32:000000000-J47WM:1:1101:18259:1200, setting score to 0.”

and also (again, many of them)

“sequence name mismatch btwn fasta: M02542_32_000000000-J47WM_1_1101_18259_1200 and qual file: M02542_32_000000000-J47WM_1_1101_17414_1199”

On a previous version, mothur just kept on generating errors until I manually killed it. In the latest version it stops, and give the message:

"[ERROR]: std::bad_allocRAM used: 4.7353Gigabytes . Total Ram: 5.87916Gigabytes.

has occurred in the QualityScores class function trimQScores. This error indicates your computer is running out of memory. This is most commonly caused by trying to process a dataset too large, using multiple processors, or a file format issue. If you are using multiple processors, try running the command with processors=1, the more processors you use the more memory is required. Also, you may be able to reduce the size of your dataset by using the commands outlined in the Schloss SOP, http://www.mothur.org/wiki/Schloss_SOP. If you are unable to resolve the issue, please contact Pat Schloss at mothur.bugs@gmail.com, and be sure to include the mothur.logFile with your inquiry."

So: is this an issue with the raw files themselves, or some other issue? The files sizes are 1.5 GB (fasta) and 4 GB (qual). I am using a very new laptop. I have 8GB RAM. I guess it could be a performance issue, but I don’t get it since I have run similar (and larger) datasets on a laptop that was 8 years old! And with only 4GB RAM (which eventually crashed last year)

Thanks in advance.

pschloss · May 21, 2020, 8:58pm

Hi

I suspect something got out of whack with your fasta and qual files. Normally, you’ll get fastq files from an Illumina machine, which embeds both the sequence and quality score data. Can you get those instead of separate fasta and qual files? Also, I’m not sure what your downstream goals are, but we discourage quality trimming sequences before forming contigs. It actually makes the assembly worse than leaving them at their full length.

Thanks,
Pat

jhkeet · May 22, 2020, 9:50am

Okay, thanks for the help. We are going to contact them regarding the fastq files. Interestingly, this particular sequence provider has not given us fastq files in the previous two or three projects. So not sure what’s up with that…

jhkeet · May 22, 2020, 12:05pm

Just a note: the gene region that was sequence is the short nodC gene region (nodulation gene for nitrogen fixers), so I think only the forward sequences were obtained. This would give no use for make.contigs, right? I see the fastq.info can generate fasta and qual files, maybe these will then match up correctly.

pschloss · May 25, 2020, 5:22pm

Right - if you only have reads in one direction then there’s no use for make.contigs and you’re right to use trim.seqs.

Pat

jhkeet · May 26, 2020, 8:35am

Just a quick question: I made separate fasta and qual files from the fastq file (with fastq.info), but upon using trim.seqs thereafter, literally all the sequences get chucked out as scrap (even with setting pdiff=2 and bdiff=2). Could this be indicative of a botched Illumina output?

JH

fastq.info(fastq=040920SWnodC_pr.fastq)
trim.seqs(fasta=040920SWnodC_pr.fasta, qfile=040920SWnodC_pr.qual, oligos=040920SWnodC.oligos, qaverage=25, pdiffs=2, bdiffs=2)

pschloss · May 26, 2020, 11:04am

you might leave out the qaverage=25 argument and see how many more sequences get through.

jhkeet · May 26, 2020, 3:13pm

Okay I tried a couple of combinations now (increasing bdiffs and pdiffs) as well as decreasing qaverage like you mention (even increasing them in the fastq.info command), until at least 1 sequence is not scrapped

So I am adding just add a few scrap sequences so you can see. I am not sure what the error codes “k” and “kr” mean (they seem to dominate), they are not on the wiki as far as I can tell. Maybe you can say what they mean (there is a kb and kbr too). Also adding the code I used.

fastq.info(fastq=040920SWnodC_pr.fastq, pdiffs=3, bdiffs=3)
trim.seqs(fasta=040920SWnodC_pr.fasta, qfile=040920SWnodC_pr.qual, oligos=040920SWnodC.oligos, qaverage=0, pdiffs=3, bdiffs=3)

>a12__M02542_32_000000000-J47WM_1_2105_11617_11802 | k	bdiffs=3(match) fpdiffs=0(match) rpdiffs=0(match) 	
TCTGCGTCAACAATTGCGGTGGGCACGGAGTACCTTCCGGGATACAGCGCTTGCGCTACCCCTGCTGCCAAGCCTCGACTTCTACATCACGCTGGACATTGTCGGGCAGAACCTCCTGCCGCTTCTGCTCGGCGTTTCAATACTGACCGCGCTTGCGCAAATCGCACTGACGAGCGAGTTGCCGTGGCCGACGGTACTGATCATCGCGGCAATGACCATGGTTCGCTGCAGCTTGGCAGCATTTCGTGCCCGACAGCTTCGCTTTCTCGCTTTCGCTCTACACAAACCGATCAGCATGTTCCTGTTACTTCCTATGAAGGTCTACGCCTTGTGCAC
>a26__M02542_32_000000000-J47WM_1_1111_22862_14546 | kr	bdiffs=3(match) fpdiffs=0(match) rpdiffs=11(noMatch) 	
TCTGCGCCAACAACTCCGCTGGGCTCGAAGCACTTTCCGGGACACCCTGCTTGCGCTGCGTCTCCTGCCCAGTCTCGATCCTTATCTCACGTTGGACGTGGTCGGACAGAACCTCGGGCCCCTACTTCTCGCCCTCTCCGTAACCACAGCGATCGCACACCTCGTGTTGACTGGCGAAATGCCCTGGTGGACGGTCATCATGATTTTATCGATGACTATAATTCACTGCAGCATGTCAGCGTTGCGAGCTCGTCAACTGCGATTTCTAGGATTTTGTCTGCACACACCTATCAACCTCTTTCTTTTGCTTCCC


>a12__M02542_32_000000000-J47WM_1_2105_11617_11802	0.0269485
38 38 38 38 38 38 38 38 38 38 38 38 38 38 34 37 38 38 38 38 38 38 38 38 38 38 38 42 42 42 42 42 42 30 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 
>a26__M02542_32_000000000-J47WM_1_1111_22862_14546	0.025731
38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38

pschloss · May 26, 2020, 4:13pm

Can you post what your oligos file looks like?

jhkeet · May 26, 2020, 5:47pm

Here it is:

linker	CCGGATAGGMTGGKBCCRTA
reverse	GTGCACAASGCRTADRCCTTCAH
barcode	GCTAGCCC	s1
barcode	GCTAGGTC	s2
barcode	GCTAGTAG	s3
barcode	CGAGGGCA	s4
barcode	CGAGGTGG	s5
barcode	CGAGTAAA	a1
barcode	CGAGTAGA	a2
barcode	CGAGTCAT	a3
barcode	CGAGTCCA	a4
barcode	CGAGTGCT	a5
barcode	CGAGTTAT	a6
barcode	CGAGTTGA	a7
barcode	CGATACAG	a8
barcode	CGATAGAG	a9
barcode	CGATAGCG	a10
barcode	CGATAGGA	a11
barcode	CGATAGGT	a12
barcode	CGATCATT	a13
barcode	CGATGAGA	a14
barcode	CGATGGAT	a15
barcode	CGATGGCT	a16
barcode	CGATGTGA	a17
barcode	CGATTAAT	a18
barcode	CGATTACA	a19
barcode	CGATTGAA	a20
barcode	CGATTGTA	a21
barcode	CGATTGTG	a22
barcode	CGCAAATT	a23
barcode	CGCAACTA	a24
barcode	CGCACTCA	a25
barcode	CGCAGAAT	a26
barcode	CGCAGATG	a27
barcode	CGCAGCAT	a28

jhkeet · May 26, 2020, 6:49pm

So just a note: i added removed the linker part (added it in when I struggled with original fasta and qual mismatches), and ran it again with the normal parameters and the number of discarded sequences still seem extremely low…

mothur > 
trim.seqs(fasta=040920SWnodC_pr.fasta, qfile=040920SWnodC_pr.qual, oligos=040920SWnodC.oligos, qaverage=25, pdiffs=2, bdiffs=2)

Using 1 processors.
It took 951 secs to trim 1144634 sequences.

Group count: 
a1	49
a10	80
a11	71
a12	4
a13	2
a14	123
a15	20
a16	28
a17	123
a19	1
a2	57
a20	117
a21	14
a22	858
a23	679
a24	15
a25	4
a26	12
a27	57
a28	51
a4	82
a5	52
a6	10
a7	3
a8	18
a9	138
s1	1547
s2	268
s3	35
s4	162
s5	85
Total of all groups is 4765

pschloss · May 28, 2020, 12:47pm

Can you tell us how the sequencing was done? What primers were used to do the sequencing? Was there a separate index sequencing read?

A couple of things I can tell you…

You do not want to use reverse. In your oligos file. mothur will expect the sequence to end with the reverse complement of the reverse oligo. The first sequence you posted has the reverse primer in it. I suspect that’s because the sequencing read through the end of the read.
You likely don’t want the linker (that’s the k error code). Given the degeneracies in your linker sequence, I suspect that is really meant to be forward. But I can’t find it in the two sequences you posted.
trim.seqs will expect the sequence to start with the barcode sequence and then the forward primer sequence if you give trim.seqs those values. If you used the gene primer to sequence then your primer would not appear in the sequence. If a separate index read was generated to get the barcode sequence, then that also wouldn’t show up in the fastq file.

Pat

jhkeet · May 29, 2020, 10:36am

Okay, the sequencing was done via an external service provider (MrDNA). We requested the following primers:

nodCFI2F CCGGATAGGMTGGKBCCRTA
nodCRI2R GTGCACAASGCRTADRCCTTCAH

I see that the “linker” is indeed the forward primer. Strange, since this is what the first few lines of their mapping file looks like:

#SampleID	BarcodeSequence	LinkerPrimerSequence	BarcodeName	ReversePrimer	ProjectName	Description
s1	GCTAGCCC	CCGGATAGGMTGGKBCCRTA	nodCFI2Fbar1	GTGCACAASGCRTADRCCTTCAH	040920SWnodC	s1
s2	GCTAGGTC	CCGGATAGGMTGGKBCCRTA	nodCFI2Fbar2	GTGCACAASGCRTADRCCTTCAH	040920SWnodC	s2
s3	GCTAGTAG	CCGGATAGGMTGGKBCCRTA	nodCFI2Fbar3	GTGCACAASGCRTADRCCTTCAH	040920SWnodC	s3
s4	CGAGGGCA	CCGGATAGGMTGGKBCCRTA	nodCFI2Fbar4	GTGCACAASGCRTADRCCTTCAH	040920SWnodC	s4

I have now replaced the reverse sequence with the forward sequence, and tried a couple variations again (i.e. increase bdiffs and remove qaverage), and still pretty much all the sequences get scrapped.

Hope that helps?

Oh, and when I use the forward primer instead, the errors are mostly of “f” and some “fb”, e.g. bf bdiffs=4(noMatch) fpdiffs=12(noMatch)

jhkeet · June 2, 2020, 11:31am

So just a follow up on this: I opened the original fasta file, and sure enough the barcode and forward primers are there, right next to each other in sequence (barcode in italic and forward in bold; first line given as an example):

M02542:32:000000000-J47WM:1:1101:18419:1298
CGAGGGCACCGGATAGGATGGGTCCATATCTGCGTCAACAATTGCGGTGGGCACGGAGTACCTTCCGGGATACAGCGCTTGCGCTACCCCTGCTGCCACGCCTCGACTTCTACATCACGCTGGACATTGCGGGGCAGAACCTCCTTCCGCTTCTGCTCGGCGTTTCCATACTGACCGCGCTTGCGCAAATCGCACTGACGAGCGAGTTGCCGTGGCCGACGGCACTGATCATCGCGTCAATGACCATGGTTCGCTGCAGCTTGGCAGCATTTCGTGCCCGACAGCTTCGCTTTCTCGCTTTCGCTCTACACAAACCGATTAGCATGTTCCTGTTACTTCCTGTGAAGGTTTACGCCTTGTGCAC

Moreover, when I completely remove the qual file from trim.seqs (which is obviously not the best idea), the function runs like clockwork, classifying everything.

So, when including the qual file, I wonder if mothur stops working due to giving too many errors? The reasoning for this is, that before it crashes (i.e. when I run the qual and fasta files both in trim.seqs), it actually does let some sequences pass (while scrapping some others). So maybe the amount of mismatches are just too many?

pschloss · June 2, 2020, 12:31pm

Sounds like progress, right? I still don’t see the barcode or forward primer in the two sequences you posted earlier, but it sounds like they were already processed in some way?

As for the quality score data, I suspect you’re losing them because qaverage=25 is too stringent. I think the MiSeq keeps sequences that have an average quality score of 20, so 25 might just be tossing everything. The qaverage is over the full length of the sequence. There are other options for dealing with quality score data in trim.seqs. You might try rollaverage (eg rollaverage=25) or use the trimming approach we describe in the 454 SOP. There we used qwindowaverage=35, qwindowsize=50. Keep in mind, I don’t know how this all will work since we’ve never worked with single reads (their quality is generally so poor that we never pursued anything but dual reads).

Pat

jhkeet · June 2, 2020, 3:18pm

RE your first part: Yes, I don’t know how that works. That fasta and qual were made from the fastq file, so presumably the fastq did not have that info… whereas the original full fasta does have that info. Not sure how that works…

Then for the rest: I even tried to drop the qaverage to 1, then also tried the rollaverage=25 as you mentioned, and even tried rollaverage =1, but each time it still bombs out with the same original errors, i.e. the one of “expected a number and got XXXX” as well as “sequence name mismatch btwn fasta and qual file”. So it is only when I completely remove the qual parameter that it goes its merry way.

There must some fundamental mismatch between the fasta and qual, that is all I can think of? Not sure how that could have happened though.

pschloss · June 2, 2020, 4:54pm

Are you starting with a fastq file or a fasta and qual file?

jhkeet · June 2, 2020, 5:36pm

With fasta and qual (but not mothur generated ones, they came separately with the fastq).

When I make fasta and qual files from the fastq (using fastq.info), is when there are so many barcode differences, as mentioned earlier.

pschloss · June 2, 2020, 8:41pm

You need to get the original fastq file from the sequencing vendor
Pat

system · June 12, 2020, 8:41pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
trim.seqs bugs mothur bugs	2	3213	May 9, 2011
Help please with a command o mothur Theory behind mothur	8	659	March 7, 2021
Analysis of Illumina data - problem with make.contigs Commands in mothur	10	9955	September 10, 2014
trim.seqs error for V1.18 mothur bugs	1	3159	May 27, 2011
Stumped on getting started Commands in mothur	1	3219	September 17, 2014

File error right from the start

Related topics