more contigs than raw sequences?!

I’m assembling a table that contains the number of raw sequences for each of my samples as well as how many sequences are removed during each quality control step in my pipeline. I’m having a really bizarre issue and I have no idea what to do. I really hope you all can help me out. For a large number of my samples (67/648) the number of raw sequences that I start out with is LOWER than the number of sequences that are assembled after making my contigs. Before I give additional details, here are a few things to note:

I am using the December 2014 version of mothur (1.34.3). When I try using the March or July versions, something goes wrong when I run make.contigs and mothur shuts down. I don’t have the terminal window up anymore, but I think it said Error 11? I also wanted to note that some of the files I am using have a low number of sequences (>10). I’m about to try running make.contigs without these sequences, but I’m not sure why this would be causing a problem. I also have multiple file pairs for each group. For example, in the file I provide for the make.contigs command I will have 5 pairs of forward and reverse files with the same group name. Finally, I am using multiple processors.

Here is the shortlist of things I’ve checked/troubleshooted:

  1. I have confirmed the file used in make.contigs is correct. Every sample name and every corresponding forward and reverse sequence is as it should be.
  2. The original script I used to count the number of reads used the forward read file. I altered the script and confirmed that the reverse read file was generating the same number of raw sequences, and that these values are correct.
  3. I’ve opened up the fasta files for samples that are affected by this issue and they appear to be normal.
  4. Someone suggested that it might be something with my file names, as some of them contained dashes as well as underscores. I changed all of the dashes to underscores and reran all of my data. The output was identical, so this didn’t change anything.
  5. I’ve run a few small test batches containing just a few samples. It appears as if things work properly when I run them in smaller groups. Below is an example of the output I get for three different samples. raw.seqs is the number of raw sequences, data set is the number of sequences I get for those groups when I run make.contigs on the entire dataset, and test is what I get when I run make.contigs only on those three groups.

raw.seqs data set test
T1G10 940300 940300 940300
T0_T1G10 156359 170409 156358
T7G7 1 88417 N/A (files were skipped b/c they were blank or had too few reads)


Any ideas? I'd be happy to provide any and all data files you'd like. As always, thanks for all the help!

Hi there,

Can you use the most recent version of mothur and try running it with only one processor?

Pat

Yes, I’ve started running it now using version 1.36.1 with one processor.

I reran it last night once again using version 1.36.1 with multiple processors. The error I was getting was Segmentation fault: 11. Not sure if I’ll get this error if I use one processor, but I guess I’ll know soon enough.

I got the same error, Segmentation fault: 11, when running it using version 1.36.1 with one processor.

Any ideas on how to proceed?

So I tried running my data again using the December 2014 version of mothur (1.34.3) using multiple processors. This time, I altered the file I provide make.contigs so that every single sample forward and reverse pair had a unique group name. I also went ahead and removed reference to any files that were smaller than 1kb.

The results were better in that only 36 rather than 67 of the total groups were affected, but I’m still having the same problem. What is also concerning is that out of the 36 samples affected in this new run, only 18/36 are total groups that were affected during the run I mentioned in my original post.

I’ve been looking at the readouts I’ve been seeing things like this:

total group name individual group name forward file raw make.contigs
T5G1 T5G1_C T5G1A_S143_L001_R1_001_3.fastq 5 501
T5G1 T5G1_D T5G1_S121_L001_R1_001_1.fastq 501 1107
T5G1 T5G1_E T5G1_S121_L001_R1_001_2.fastq 1107 77341
T5G10 T5G10_A T5G10A_S140_L001_R1_001_1.fastq 77341 108959
T5G10 T5G10_C T5G10_S128_L001_R1_001_1.fastq 71859 101354
T5G6 T5G6_A T5G6-CD_S180_L001_R1_001_4.fastq 4 56806

Help?

So I still have no idea what’s causing all the problems, but as a work around solution I (or rather someone with a little more programming expertise than I) created a batch file in perl to run make.contigs on each pair of .fastq files separately. It’s definitely a workaround, so I’d love to know when this bug gets figured out.

Hi,
I want to help you resolve this issue completely and move forward with your analysis. Could you give the Mothur.cen_64.noOpt.zip version https://github.com/mothur/mothur/releases/tag/v1.36.1 a try? If that doesn’t work for you, can you email your log file to mothur.bugs@gmail.com?
Thanks,
Sarah

I can’t execute the Mothur.cen_64.noOpt.zip version. When I try to, it gives me this output: -bash: ./mothur: cannot execute binary file

If it’s relevant, I am running OSX version 10.9.5.

What else would you like me to do?

I’m sorry I assumed you were running mothur on a Linux machine. The executable I suggested will not work on a Mac. Can you send your input files to mothur.bugs@gmail.com? If they are too large to email you can use dropbox to share them with me.

All the files take up ~50GB. I’m assuming you want them all, since this problem doesn’t seem to be happening if I run them in smaller groups. Can I send you the files via FTP? Otherwise I’m working on compressing them and figuring out who in the lab has a large dropbox limit.

That took longer than it needed to, but you should now have an e-mail giving you access to a shared folder on my google drive. I also sent you a separate e-mail with the direct link. Thank you for all the help and let me know if you need anything else from me!

Any updates on this?

Can you try running the command with our current version 1.37.4? I am not able to reproduce the issues you are having. For example when I run T3G10_C mothur is reporting 37381 reads.

Just ran the command with version 1.37.6 and the raw sequence and make.contigs values now match. Not sure what the original problem was, but it appears to be fixed in the new version!

Thanks for the help!