New problem running SOP

I’ve been trying to work my way through the SOP on an HMP mock dataset. Everything seems to work ok until:

mothur > pre.cluster(fasta=reads.shhh.trim.unique.good.filter.unique.fasta, name=reads.shhh.trim.unique.good.filter.names, group=reads.shhh.good.groups, diffs=2)

[ERROR]: Your name file contains 973 valid sequences, and your groupfile contains 1653, please correct.

The commands prior to that were:

sffinfo(sff=reads.sff, flow=T)
trim.flows(flow=reads.flow, oligos=oligos.txt, pdiffs=2, bdiffs=1, processors=8, minflows=300, maxflows=300)
shhh.flows(file=reads.flow.files, processors=8)
trim.seqs(fasta=reads.shhh.fasta, name=reads.shhh.names, oligos=oligos.txt, pdiffs=2, bdiffs=1, maxhomop=8, minlength=200, flip=T, processors=8)
unique.seqs(fasta=reads.shhh.trim.fasta, name=reads.shhh.trim.names)
align.seqs(fasta=reads.shhh.trim.unique.fasta, reference=silva.bacteria.fasta, processors=8)
screen.seqs(fasta=reads.shhh.trim.unique.align, name=reads.shhh.trim.unique.names, group=reads.shhh.groups, end=27659, optimize=start, criteria=95, processors=8)
filter.seqs(fasta=reads.shhh.trim.unique.good.align, vertical=T, trump=., processors=8)
unique.seqs(fasta=reads.shhh.trim.unique.good.filter.fasta, name=reads.shhh.trim.unique.good.names)

Hmmm… Have you possibly gone back and run the commands in a different order than shown here? If this is the order, can you email me the following?

reads.shhh.trim.unique.align
reads.shhh.trim.unique.names
reads.shhh.groups

Also, what version of mothur are you running?

I’m kind of surprised that you get any sequences out of this at all. If you trim your flows to 300, your sequences are likely to be ~175 bp. Then in trim.seqs with minlength=200, you’ll probably scrap most of the sequences. That doesn’t solve the screen.seqs problem, but it’s worth you addressing at some point.


Pat

Version is mothur v.1.27.0 (8/8/2012) 64-bit exe under Win7/Cygwin. Looks to me like the error is most likely caused by my misunderstanding of the file naming conventions, something like an extra / spurious …v35… where there are files from a previous step both with and without the v35, but I’ve tried several variations without success. I removed the v13 & v69 primers from oligos.txt so that everything is v35 only as far as possible (this is actually the only region I need). I posted the files needed to reproduce the problem here: http://drive5.com/tmp/mot.tar. This contains mothur_before.tz and mothur_after.tz. These are archives of the directory before and after running mothur (so you can review the output files without waiting hours for the denoise step). Script is mothur_cmds.txt. I used min/maxflows=300 because the SOP recommends 300-350 this for GSFLX, which is the machine according to SRA for the data (SRR053818), and I thought shorter would be more conservative because presumably longer flows are noisier because 454 Q scores drop with length. I know nothing about flowgrams so I don’t understand how to set these parameters intelligently. Mothur thinks it is Titanium data because it reads the Titanium .pat file but none of the other .pat’s, so maybe there are conflicting annotations in the .sra file. I tried without min/maxflows but this didn’t help. Thanks again for your help and sorry if this is an elementary mistake on my part.

Hmmm. Not sure what happened. Here’s what you ran…

sffinfo(sff=reads.sff, flow=T)
trim.flows(flow=reads.flow, oligos=oligos.txt, pdiffs=2, bdiffs=1, processors=8)
shhh.flows(file=reads.flow.files, processors=8)
trim.seqs(fasta=reads.v35.shhh.fasta, name=reads.v35.shhh.names, oligos=oligos.txt, pdiffs=2, bdiffs=1, maxhomop=8, minlength=200, flip=T, processors=8)
unique.seqs(fasta=reads.v35.shhh.trim.fasta, name=reads.v35.shhh.trim.names)
align.seqs(fasta=reads.v35.shhh.trim.unique.fasta, reference=silva.bacteria.fasta, processors=8)
screen.seqs(fasta=reads.v35.shhh.trim.unique.align, name=reads.v35.shhh.trim.unique.names, group=reads.v35.shhh.groups, end=27659, optimize=start, criteria=95, processors=8)
filter.seqs(fasta=reads.v35.shhh.trim.unique.good.align, vertical=T, trump=., processors=8)
unique.seqs(fasta=reads.v35.shhh.trim.unique.good.filter.fasta, name=reads.v35.shhh.trim.unique.good.names)
pre.cluster(fasta=reads.v35.shhh.trim.unique.good.filter.unique.fasta, name=reads.v35.shhh.trim.unique.good.filter.names, group=reads.v35.shhh.good.groups, diffs=2)

When I run it starting at trim.seqs everything seems good. However what I get out of trim.seqs is different from what you got. Could you perhaps start over at that point and see how it goes?

A couple other things…

  • There’s nothing here to limit your min/maxflows to 300. By default trim.flows uses 450 flows. Perhaps you did this in trim.flows once and had a mixture of files from one run and files from another? The fasta file is definitely from the 450 flows.
  • When we analyzed very similar data for the PLoS ONE paper we didn’t see a benefit to going to lower numbers of reads.
  • These data are from Titanium chemistry. If you ever wanted to use the FLX chemistry those *pat files are available on the wiki
  • I suspect running things under cygwin are going to be much slower than mac/linux. For some reason (you probably know better than me), the same code compiled for the different OS’s always runs slower on windows than the 'nixes.

Pat

I’m 100% sure these are clean before and after snapshots. I triple-checked by downloading and running again from an empty directory. I also tried Linux, and it worked fine. So this looks like a bug in the Windows build, or at least a significant difference in behavior. FWIW, I use Windows as my primary development environment because I find Visual Studio to be by far the best C++ development environment on any platform. On my tests, the VS optimizing compiler is surprisingly good, almost as good as the Intel compiler. With MUSCLE, Intel is slightly better than VS or gcc, but with USEARCH it’s a tie and I get comparable performance between VS on Windos and icc & gcc on Linux.