Analysis of Illumina data - problem with make.contigs

Hi all,

I’m a beginner with the analysis of Illumina data and I’m stuck at the very first step. It’s probably something easy to fix but I don’t know where to look at.And as I’m leaving the lab in less than a month, I’m a little bit in a hurry. I checked on the forum but I can’t find a solution …but all my apologies if I missed it !

My supervisor wanted an analysis with only single end data. I tried what was adviced on the Mothur forum (cfr. Illumina single-read with index in 2nd sequencing run) but unfortunately, I couldn’t manage to do the make.contigs. It never finds the paired read, even if the name are the same in the R1 file and in the rc-R1 file. I tried different parameters (reverse complement the barcode, changing the mapping file, using rindex, findex,…) but the problem is always the same.

This is an example of the cmd I used:

make.contigs(ffastq=sequences.fastq, rfastq=sequences.rc.fastq, oligos=oligos.txt, rindex=index.fastq, processors=8)

So I then tried to use the R1 and R2 files instead of the R1 and a rc file that I created myself. I thought it would be easier…but not really…There is an error message at the end is : Error reading quality file, name blank at position, -1 . The few sequences in the scrap files indicate a problem with the barcode.

This is an example of the cmd I used :
make.contigs(ffastq=sequences-R1.fastq, rfastq=sequences-R2.fastq, oligos=oligos.txt, rindex=index.fastq, processors=8)

And this is an example of the oligos file that I used…(one of the numerous oligos file)
#primer CCGGACTACHVGGGTWTCTAAT
BARCODE NONE ACTAGGATCAGT TM1.0
BARCODE NONE GCTCCTTAGAAG TM2.0
BARCODE NONE TCCCATTCCCAT TM3.0
BARCODE NONE TGGCGTCATTCG TM4.0
BARCODE NONE AATCCTCGGAGT MW1.0
BARCODE NONE CTGGACGCATTA MW2.0
BARCODE NONE ACCGATTAGGTA MW3.0
BARCODE NONE ATGTGCTGCTCG MW4.0


Pat Schloss advised me to use trim.seqs for the single-end analysis but I don't know how to insert my index file...(again, maybe something really stupid, so I apologize in advance if it is the case). So, that's where I'm for the moment ...You'll probably need to se one or another file, so just ask :)

Thanks so much for your help!

Can we take a step back - what protocol was used to do your library generation and sequencing? Which region and how long are the reads?

They used the protocol as in the earthmicrobiome project.

In summary:

Thanks for the help…I really appreciate :slight_smile:

Steph

So the command you want is:

make.contigs(ffastq=sequences-R1.fastq, rfastq=sequences-R2.fastq, oligos=oligos.txt, rindex=index.fastq, processors=8)

Although you probably want to change the “-” characters to “.”

Can you start over from the original compressed fastq files, decompress, and run this command and tell me what you get? You can also forward the files to me via dropbox, google, etc. and I can take a look [mothur.bugs@gmail.com].

Pat

Actually the first idea was to analyze single-end reads. But as I didn’t find a way to do it (I tried by creating a reverse complement file of my R1 file, as advised on the forum), I tried with paired-end…The idea beyond was also to checked if my initial files were good for the analysis.

So if possible, I’d like to do a single-end analysis (R1 file and index file). But I still don’t know how to do that. My questions about the single-end analysis are:

  • Is it possible to use ‘trim.seqs’ with an index file?
  • Is there another possibility than creating a reverse complement file from the R1 file?

For the paired-end:
I ran the same command, as you advised me. Here is the results:

make.contigs(ffastq=sequences-R1.fastq, rfastq=sequences-R2.fastq, oligos=oligos.txt, rindex=index.fastq, processors=8)

Using 8 processors.
Reading fastq data…
10000
20000

532000
533000
534000
534626
ERROR]: 28471.num.temp is blank. Please correct.
[ERROR]: 28472.num.temp is blank. Please correct.
[ERROR]: 28473.num.temp is blank. Please correct.
[ERROR]: 28474.num.temp is blank. Please correct.
[ERROR]: 28475.num.temp is blank. Please correct.
[ERROR]: 28476.num.temp is blank. Please correct.
[ERROR]: 28477.num.temp is blank. Please correct.
Done.
It took 1241 secs to process 15009503 sequences.

Output File Names:
sequences-R1.trim.contigs.fasta
sequences-R1.scrap.contigs.fasta
sequences-R1.contigs.report
sequences-R1.contigs.groups

[WARNING]: your sequence names contained ‘:’. I changed them to ‘_’ to avoid problems in your downstream analysis.

summary.seqs(fasta=sequences-R1.trim.contigs.fasta)


Using 8 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 293 293 1 4 1
2.5%-tile: 1 293 293 1 4 1
25%-tile: 1 293 293 1 4 1
Median: 1 294 294 1 5 2
75%-tile: 1 296 296 1 5 3
97.5%-tile: 1 296 296 1 5 3
Maximum: 1 296 296 1 5 3
Mean: 1 294.333 294.333 1 4.66667

of Seqs: 3

Output File Names:
sequences-R1.trim.contigs.summary

…so there is only 3 sequences in the trim file…

I also sent you the fastq files via wetransfer…just in case. It’s still transferring for the moment so hopefully, you should receive it in a couple of hours. I haven’t enough space in my dropbox (even empty) and I have no other idea to send you the files …

Thanks :slight_smile:

Hi Pat,

Did you receive the three files ? I sent them via wetransfer…


Cheers,

Steph

Steph,

I ran into this dealing with another lab handing me a single read and single index fastq files…they said the other read “failed”.

I solved it with some linux brute force as follows:

  1. in mothur, use fastq.info -> generate a .fasta and .qual for both the read fastq and for the index fastq

  2. in linux use sed to remove the > from the index files, both .qual and .fasta as follows (I think, can’t remember for sure)

sed ‘s/>//g’ index.fasta > index.edited.fasta

  1. in linux use paste to “rebuild” your reads by tacking the barcodes onto the ends (do this for both .fasta and .qual):

paste -d ‘’ read1.fasta index.edited.fasta > read1.barcoded.fasta

NOTE: the -d switch is followed by two single quotes. Paste normally adds a tab between the files. You don’t want that. Check the result using the linux command head read1.barcoded.fasta to make sure it pasted smoothly. If not, use paste without the -d ‘’ and then use sed to delete the tabs.

  1. Now you basically have pyrosequencing reads, with barcodes on each read. In mothur, use trim.seqs to de-multiplex: trim.seqs(fasta=read1.barcoded.fasta,qfile=read1.barcoded.qual,oligos=your.barcodes,checkorient=t,qaverage=25,maxambig=0,bdiffs=1,minlength=150)

Craig Nelson
craig.nelson@hawaii.edu

Hi Craig,

Thanks for the reply…I’m not an expert of Linux and I didn’t know how to do it! Thanks a lot :slight_smile:

Sorry for the delay but I wanted to be sure that it was working before replying.

So your Linux cmd worked and I obtain the files needed for trim.seqs. I just had a problem when using trim.seqs but I think it’s due to the number of processors used:

  • When I used 1 processor, the trim.seqs stopped during the night, in the middle of the file .
    The last lines were just numbers.

  • When I used 8 processors, I got an error message and I have only 29 sequences at the end:

1799580
1799584
Appending files from process 3236
Appending files from process 3237
Appending files from process 3238
Appending files from process 3239
Appending files from process 3240
Appending files from process 3241
Appending files from process 3242
[ERROR]: Could not open sequences-R1.barcoded.fasta3242.num.temp

Group count:
cMW2.1 29
Total of all groups is 29

Output File Names:
sequences-R1.barcoded.trim.fasta
sequences-R1.barcoded.scrap.fasta
sequences-R1.barcoded.trim.qual
sequences-R1.barcoded.scrap.qual
sequences-R1.barcoded.groups

I’m gonna try with 2 processors …just in case but do you have an idea of what could be the problem? Maybe it’s link to my initial files (Pat, would you have an idea also if you had a look at the files?)

Thanks a lot for your help…My project will end at the end of June so I really appreciated the help :slight_smile:

Hi,

Again, it’s not working :frowning: I’m desperate (or the worse user of Mothur in the world)

Here is the error message :


[ERROR]: has occurred in the QualityScores class function QualityScores. Please contact Pat Schloss at , and be sure to include the mothur.logFile with your inquiry. Appending files from process 11951 [ERROR]: Could not open sequences-R1.barcoded.fasta11951.num.temp [ERROR]: is in your fasta file more than once. Sequence names must be unique. please correct. [ERROR]: is in your fasta file more than once. Sequence names must be unique. please correct. [ERROR]: is in your fasta file more than once. Sequence names must be unique. please correct. ... Group count: 2048 cMW2.1 29 Total of all groups is 2077

Output File Names:
sequences-R1.barcoded.trim.fasta
sequences-R1.barcoded.scrap.fasta
sequences-R1.barcoded.trim.qual
sequences-R1.barcoded.scrap.qual
sequences-R1.barcoded.groups


any idea? Thanks a lot :)

Hi everyone,


Time flies and I'm still stuck with my analysis :cry: If someone have an idea of what could help, please, don't hesitate. I was supposed to analyse the dataset by single-end reads but seing that I don't find any solution and that it's becoming urgent, I'm in even if it's a solution for paired-end analysis now! :) Thanks a lot for your help!

So, I tried a solution explained in the forum : http://www.mothur.org/forum/viewtopic.php?f=3&t=3159&start=10#p9096

To do that as fast as possible, I’ve done it with 5 of my barcodes, not the entire oligo file.
Good news: no more error message.
Bad news: 0 sequence in the trim file…all the sequences are in the scrap file.

Have I done something wrong?

Here are the oligo file, the cmd and what Mothur wrote at the end of the make.contigs:

BARCODE ACTAGGATCAGT NONE TM1.0
BARCODE GCTCCTTAGAAG NONE TM2.0
BARCODE TCCCATTCCCAT NONE TM3.0
BARCODE TGGCGTCATTCG NONE TM4.0
BARCODE AATCCTCGGAGT NONE MW1.0

mothur > make.contigs(ffastq=Undetermined_S0_L001_R1_001.fastq, rfastq=Undetermined_S0_L001_R2_001.fastq, oligos=oligos.test.txt, rindex=Undetermined_S0_L001_I1_001.fastq)

Using 1 processors.
Reading fastq data…
10000
20000
30000
40000

15009000
15009503
Done.
It took 40417 secs to process 15009503 sequences.


Output File Names: Undetermined_S0_L001_R1_001.trim.contigs.fasta Undetermined_S0_L001_R1_001.scrap.contigs.fasta Undetermined_S0_L001_R1_001.contigs.report Undetermined_S0_L001_R1_001.contigs.groups

[WARNING]: your sequence names contained ‘:’. I changed them to ‘_’ to avoid problems in your downstream analysis.


Thanks again for your help :)

Your data was actually the dataset that I developed the work around for in the other thread. I’ve emailed you the corrected oligos file.

Pat