mothur

File error right from the start

mothur v.1.44.1 running of Windows 10 64 bit

Hi all,

I am getting errors right from the very start of my pipeline. The Illumina files I received from the service provider are the typical run-of-the-mill fasta and qual files, and mapping files.

I open mothur, and run the following commands:

set.current(processors=1, fasta=040920SWnodC_full.fasta, qfile=040920SWnodC_full.qual, oligos=040920SWnodC.oligos)

trim.seqs(fasta=current, qfile=current, oligos=current, qaverage=25, pdiffs=2, bdiffs=2)

And then just errors all the way… specifically stuff like this (many of them):

“[ERROR]: In sequence M02542_32_000000000-J47WM_1_1101_15286_1199’s quality scores, expected a number and got >M02542:32:000000000-J47WM:1:1101:18259:1200, setting score to 0.”

and also (again, many of them)

“sequence name mismatch btwn fasta: M02542_32_000000000-J47WM_1_1101_18259_1200 and qual file: M02542_32_000000000-J47WM_1_1101_17414_1199”

On a previous version, mothur just kept on generating errors until I manually killed it. In the latest version it stops, and give the message:

"[ERROR]: std::bad_allocRAM used: 4.7353Gigabytes . Total Ram: 5.87916Gigabytes.

has occurred in the QualityScores class function trimQScores. This error indicates your computer is running out of memory. This is most commonly caused by trying to process a dataset too large, using multiple processors, or a file format issue. If you are using multiple processors, try running the command with processors=1, the more processors you use the more memory is required. Also, you may be able to reduce the size of your dataset by using the commands outlined in the Schloss SOP, http://www.mothur.org/wiki/Schloss_SOP. If you are unable to resolve the issue, please contact Pat Schloss at mothur.bugs@gmail.com, and be sure to include the mothur.logFile with your inquiry."

So: is this an issue with the raw files themselves, or some other issue? The files sizes are 1.5 GB (fasta) and 4 GB (qual). I am using a very new laptop. I have 8GB RAM. I guess it could be a performance issue, but I don’t get it since I have run similar (and larger) datasets on a laptop that was 8 years old! And with only 4GB RAM (which eventually crashed last year)

Thanks in advance.

Hi

I suspect something got out of whack with your fasta and qual files. Normally, you’ll get fastq files from an Illumina machine, which embeds both the sequence and quality score data. Can you get those instead of separate fasta and qual files? Also, I’m not sure what your downstream goals are, but we discourage quality trimming sequences before forming contigs. It actually makes the assembly worse than leaving them at their full length.

Thanks,
Pat

Okay, thanks for the help. We are going to contact them regarding the fastq files. Interestingly, this particular sequence provider has not given us fastq files in the previous two or three projects. So not sure what’s up with that…

Just a note: the gene region that was sequence is the short nodC gene region (nodulation gene for nitrogen fixers), so I think only the forward sequences were obtained. This would give no use for make.contigs, right? I see the fastq.info can generate fasta and qual files, maybe these will then match up correctly.

Right - if you only have reads in one direction then there’s no use for make.contigs and you’re right to use trim.seqs.

Pat

Just a quick question: I made separate fasta and qual files from the fastq file (with fastq.info), but upon using trim.seqs thereafter, literally all the sequences get chucked out as scrap (even with setting pdiff=2 and bdiff=2). Could this be indicative of a botched Illumina output?

JH

fastq.info(fastq=040920SWnodC_pr.fastq)
trim.seqs(fasta=040920SWnodC_pr.fasta, qfile=040920SWnodC_pr.qual, oligos=040920SWnodC.oligos, qaverage=25, pdiffs=2, bdiffs=2)

you might leave out the qaverage=25 argument and see how many more sequences get through.

Okay I tried a couple of combinations now (increasing bdiffs and pdiffs) as well as decreasing qaverage like you mention (even increasing them in the fastq.info command), until at least 1 sequence is not scrapped :frowning:

So I am adding just add a few scrap sequences so you can see. I am not sure what the error codes “k” and “kr” mean (they seem to dominate), they are not on the wiki as far as I can tell. Maybe you can say what they mean (there is a kb and kbr too). Also adding the code I used.

fastq.info(fastq=040920SWnodC_pr.fastq, pdiffs=3, bdiffs=3)
trim.seqs(fasta=040920SWnodC_pr.fasta, qfile=040920SWnodC_pr.qual, oligos=040920SWnodC.oligos, qaverage=0, pdiffs=3, bdiffs=3)
>a12__M02542_32_000000000-J47WM_1_2105_11617_11802 | k	bdiffs=3(match) fpdiffs=0(match) rpdiffs=0(match) 	
TCTGCGTCAACAATTGCGGTGGGCACGGAGTACCTTCCGGGATACAGCGCTTGCGCTACCCCTGCTGCCAAGCCTCGACTTCTACATCACGCTGGACATTGTCGGGCAGAACCTCCTGCCGCTTCTGCTCGGCGTTTCAATACTGACCGCGCTTGCGCAAATCGCACTGACGAGCGAGTTGCCGTGGCCGACGGTACTGATCATCGCGGCAATGACCATGGTTCGCTGCAGCTTGGCAGCATTTCGTGCCCGACAGCTTCGCTTTCTCGCTTTCGCTCTACACAAACCGATCAGCATGTTCCTGTTACTTCCTATGAAGGTCTACGCCTTGTGCAC
>a26__M02542_32_000000000-J47WM_1_1111_22862_14546 | kr	bdiffs=3(match) fpdiffs=0(match) rpdiffs=11(noMatch) 	
TCTGCGCCAACAACTCCGCTGGGCTCGAAGCACTTTCCGGGACACCCTGCTTGCGCTGCGTCTCCTGCCCAGTCTCGATCCTTATCTCACGTTGGACGTGGTCGGACAGAACCTCGGGCCCCTACTTCTCGCCCTCTCCGTAACCACAGCGATCGCACACCTCGTGTTGACTGGCGAAATGCCCTGGTGGACGGTCATCATGATTTTATCGATGACTATAATTCACTGCAGCATGTCAGCGTTGCGAGCTCGTCAACTGCGATTTCTAGGATTTTGTCTGCACACACCTATCAACCTCTTTCTTTTGCTTCCC


>a12__M02542_32_000000000-J47WM_1_2105_11617_11802	0.0269485
38 38 38 38 38 38 38 38 38 38 38 38 38 38 34 37 38 38 38 38 38 38 38 38 38 38 38 42 42 42 42 42 42 30 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 
>a26__M02542_32_000000000-J47WM_1_1111_22862_14546	0.025731
38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38 38

Can you post what your oligos file looks like?

Here it is:

linker CCGGATAGGMTGGKBCCRTA
reverse GTGCACAASGCRTADRCCTTCAH
barcode GCTAGCCC s1
barcode GCTAGGTC s2
barcode GCTAGTAG s3
barcode CGAGGGCA s4
barcode CGAGGTGG s5
barcode CGAGTAAA a1
barcode CGAGTAGA a2
barcode CGAGTCAT a3
barcode CGAGTCCA a4
barcode CGAGTGCT a5
barcode CGAGTTAT a6
barcode CGAGTTGA a7
barcode CGATACAG a8
barcode CGATAGAG a9
barcode CGATAGCG a10
barcode CGATAGGA a11
barcode CGATAGGT a12
barcode CGATCATT a13
barcode CGATGAGA a14
barcode CGATGGAT a15
barcode CGATGGCT a16
barcode CGATGTGA a17
barcode CGATTAAT a18
barcode CGATTACA a19
barcode CGATTGAA a20
barcode CGATTGTA a21
barcode CGATTGTG a22
barcode CGCAAATT a23
barcode CGCAACTA a24
barcode CGCACTCA a25
barcode CGCAGAAT a26
barcode CGCAGATG a27
barcode CGCAGCAT a28

So just a note: i added removed the linker part (added it in when I struggled with original fasta and qual mismatches), and ran it again with the normal parameters and the number of discarded sequences still seem extremely low…

mothur > 
trim.seqs(fasta=040920SWnodC_pr.fasta, qfile=040920SWnodC_pr.qual, oligos=040920SWnodC.oligos, qaverage=25, pdiffs=2, bdiffs=2)

Using 1 processors.
It took 951 secs to trim 1144634 sequences.

Group count: 
a1	49
a10	80
a11	71
a12	4
a13	2
a14	123
a15	20
a16	28
a17	123
a19	1
a2	57
a20	117
a21	14
a22	858
a23	679
a24	15
a25	4
a26	12
a27	57
a28	51
a4	82
a5	52
a6	10
a7	3
a8	18
a9	138
s1	1547
s2	268
s3	35
s4	162
s5	85
Total of all groups is 4765

Can you tell us how the sequencing was done? What primers were used to do the sequencing? Was there a separate index sequencing read?

A couple of things I can tell you…

  • You do not want to use reverse. In your oligos file. mothur will expect the sequence to end with the reverse complement of the reverse oligo. The first sequence you posted has the reverse primer in it. I suspect that’s because the sequencing read through the end of the read.
  • You likely don’t want the linker (that’s the k error code). Given the degeneracies in your linker sequence, I suspect that is really meant to be forward. But I can’t find it in the two sequences you posted.
  • trim.seqs will expect the sequence to start with the barcode sequence and then the forward primer sequence if you give trim.seqs those values. If you used the gene primer to sequence then your primer would not appear in the sequence. If a separate index read was generated to get the barcode sequence, then that also wouldn’t show up in the fastq file.

Pat

Okay, the sequencing was done via an external service provider (MrDNA). We requested the following primers:

nodCFI2F CCGGATAGGMTGGKBCCRTA
nodCRI2R GTGCACAASGCRTADRCCTTCAH

I see that the “linker” is indeed the forward primer. Strange, since this is what the first few lines of their mapping file looks like:

#SampleID BarcodeSequence LinkerPrimerSequence BarcodeName ReversePrimer ProjectName Description
s1 GCTAGCCC CCGGATAGGMTGGKBCCRTA nodCFI2Fbar1 GTGCACAASGCRTADRCCTTCAH 040920SWnodC s1
s2 GCTAGGTC CCGGATAGGMTGGKBCCRTA nodCFI2Fbar2 GTGCACAASGCRTADRCCTTCAH 040920SWnodC s2
s3 GCTAGTAG CCGGATAGGMTGGKBCCRTA nodCFI2Fbar3 GTGCACAASGCRTADRCCTTCAH 040920SWnodC s3
s4 CGAGGGCA CCGGATAGGMTGGKBCCRTA nodCFI2Fbar4 GTGCACAASGCRTADRCCTTCAH 040920SWnodC s4

I have now replaced the reverse sequence with the forward sequence, and tried a couple variations again (i.e. increase bdiffs and remove qaverage), and still pretty much all the sequences get scrapped.

Hope that helps?

Oh, and when I use the forward primer instead, the errors are mostly of “f” and some “fb”, e.g. bf bdiffs=4(noMatch) fpdiffs=12(noMatch)

So just a follow up on this: I opened the original fasta file, and sure enough the barcode and forward primers are there, right next to each other in sequence (barcode in italic and forward in bold; first line given as an example):

M02542:32:000000000-J47WM:1:1101:18419:1298
CGAGGGCACCGGATAGGATGGGTCCATATCTGCGTCAACAATTGCGGTGGGCACGGAGTACCTTCCGGGATACAGCGCTTGCGCTACCCCTGCTGCCACGCCTCGACTTCTACATCACGCTGGACATTGCGGGGCAGAACCTCCTTCCGCTTCTGCTCGGCGTTTCCATACTGACCGCGCTTGCGCAAATCGCACTGACGAGCGAGTTGCCGTGGCCGACGGCACTGATCATCGCGTCAATGACCATGGTTCGCTGCAGCTTGGCAGCATTTCGTGCCCGACAGCTTCGCTTTCTCGCTTTCGCTCTACACAAACCGATTAGCATGTTCCTGTTACTTCCTGTGAAGGTTTACGCCTTGTGCAC

Moreover, when I completely remove the qual file from trim.seqs (which is obviously not the best idea), the function runs like clockwork, classifying everything.

So, when including the qual file, I wonder if mothur stops working due to giving too many errors? The reasoning for this is, that before it crashes (i.e. when I run the qual and fasta files both in trim.seqs), it actually does let some sequences pass (while scrapping some others). So maybe the amount of mismatches are just too many?

Sounds like progress, right? I still don’t see the barcode or forward primer in the two sequences you posted earlier, but it sounds like they were already processed in some way?

As for the quality score data, I suspect you’re losing them because qaverage=25 is too stringent. I think the MiSeq keeps sequences that have an average quality score of 20, so 25 might just be tossing everything. The qaverage is over the full length of the sequence. There are other options for dealing with quality score data in trim.seqs. You might try rollaverage (eg rollaverage=25) or use the trimming approach we describe in the 454 SOP. There we used qwindowaverage=35, qwindowsize=50. Keep in mind, I don’t know how this all will work since we’ve never worked with single reads (their quality is generally so poor that we never pursued anything but dual reads).

Pat

RE your first part: Yes, I don’t know how that works. That fasta and qual were made from the fastq file, so presumably the fastq did not have that info… whereas the original full fasta does have that info. Not sure how that works…

Then for the rest: I even tried to drop the qaverage to 1, then also tried the rollaverage=25 as you mentioned, and even tried rollaverage =1, but each time it still bombs out with the same original errors, i.e. the one of “expected a number and got XXXX” as well as “sequence name mismatch btwn fasta and qual file”. So it is only when I completely remove the qual parameter that it goes its merry way.

There must some fundamental mismatch between the fasta and qual, that is all I can think of? Not sure how that could have happened though.

Are you starting with a fastq file or a fasta and qual file?

With fasta and qual (but not mothur generated ones, they came separately with the fastq).

When I make fasta and qual files from the fastq (using fastq.info), is when there are so many barcode differences, as mentioned earlier.

You need to get the original fastq file from the sequencing vendor
Pat