groups file out of sync with Costello pipeline

I have tried to follow the Costello pipeline exactly using a single fasta file with all sequences and a groups file that partitions the input sequences into 3 groups (A,B,C).

I am at the read.otu() step and just cannot figure out where I have gone wrong. It would seem that I did not supply the group file somewhere, but I have repeated it 3 times now following the Costello wiki page on a second monitor and cannot figure out where. Plus, I actually downloaded the sample data and followed the Costello pipeline and that worked fine. Using the same script on my data fails at the read.otu() step…

First, the pipeline:

#!/bin/bash

INPUT="input.clean"
#
#       input.clean.fasta and input.clean.groups
#
silvaBugs=/bank/silva/silva.bacteria/silva.bacteria.fasta
silvaGold=/bank/silva/silva.bacteria/silva.gold.align
ln -s $silvaBugs .
ln -s $silvaGold .

ncpu="processors=5"
mothur "#unique.seqs(fasta="$INPUT".fasta)"
mothur "#align.seqs(candidate="$INPUT".unique.fasta, template=silva.bacteria.fasta, "$ncpu")"
#mothur "#summary.seqs(fasta="$INPUT".unique.align)"
##              Start   End     NBases  Ambigs  Polymer
##Minimum:      1044    1044    1       0       1
##2.5%-tile:    1044    5249    202     0       4
##25%-tile:     1044    6324    288     0       4
##Median:       1044    6421    323     0       5
##75%-tile:     1044    6421    333     0       5
##97.5%-tile:   2090    6443    360     0       6
##Maximum:      43107   43116   392     0       7
### of Seqs:    189245

mothur "#screen.seqs(fasta="$INPUT".unique.align, name="$INPUT".names, group="$INPUT".groups, "$ncpu", minlength=200, maxlength=400, start=1044, end=6421)"
mothur "#filter.seqs(fasta="$INPUT".unique.good.align-silva.gold.align, vertical=T, "$ncpu")"
mothur "#chimera.slayer(fasta="$INPUT".unique.good.filter.fasta, template=silva.gold.filter.fasta, "$ncpu")"
mothur "#remove.seqs(accnos="$INPUT".unique.good.filter.slayer.accnos, fasta="$INPUT".unique.good.filter.fasta, name="$INPUT".good.names, group="$INPUT".good.groups, dups=T)"
mothur "#filter.seqs(fasta="$INPUT".unique.good.filter.pick.fasta, vertical=T, trump=., "$ncpu")"
mothur "#unique.seqs(fasta="$INPUT".unique.good.filter.pick.filter.fasta, name="$INPUT".good.pick.names)"
mothur "#pre.cluster(fasta="$INPUT".unique.good.filter.pick.filter.unique.fasta, name="$INPUT".unique.good.filter.pick.filter.names, diffs=2)"
mothur "#dist.seqs(fasta="$INPUT".unique.good.filter.pick.filter.unique.precluster.fasta, output=lt, "$ncpu")"
mothur "#read.dist(phylip="$INPUT".unique.good.filter.pick.filter.unique.precluster.phylip.dist, name="$INPUT".unique.good.filter.pick.filter.unique.precluster.names, cutoff=0.20);cluster(method=average)"
mothur "#read.otu(list="$INPUT".unique.good.filter.pick.filter.unique.precluster.phylip.an.list, group="$INPUT".good.pick.editted.groups)"

The read.otu command complains:

Your group file contains 481797 sequences and list file contains 281518 sequences. Please correct.
For a list of names that are in your group file and not in your list file, please refer to input.clean.unique.good.filter.pick.filter.unique.precluster.phylip.an.missing.name.

Now here comes the question: why does read.otu think there are that many sequences in the list file, for me there are 573,092 IDs but only 67,306 unique IDs

awk '{print $3}' input.clean.unique.good.filter.pick.filter.unique.precluster.phylip.an.list | sed 's/,/\n/g' | wc
 573092  573092 8596380
awk '{print $3}' input.clean.unique.good.filter.pick.filter.unique.precluster.phylip.an.list | sed 's/,/\n/g' | sort -u | wc
  67306   67306 1009590

I wrote a program that reads in the list file and makes a unique list of the IDs and then reads the group file and only outputs lines with an ID from the list file, but that gives me a list file with 67,306 IDs in it and read.otu() wont run with that either.

And yes, my group file is ID{tab}A|B|C formatted properly. Meaning I only have two tokens, separated by a tab, on each line, the ID and then the group letter A, B or C.

Thanks for reading!

Are your sequence names unique? If you send your files to mothur.bugs@gmail.com, I will take a look.

Any updates on that issue? I’m using Costello pipeline as well and I’m having similar issues.

freesci,
we’re still waiting on the files from jiaco. if you’re having the same issue with v.1.17.1, please feel free to send us the files you’re struggling with and we can try to track this down - mothur.bugs at gmail

Pat

I’ve finally tracked down the problem. Group file that goes down the pipeline should be created based on IDs from *.trim.fasta file, not from original FASTA file. I assumed that if I make the analysis from the scratch, I should create group file by myself with all IDs in it.

Sorry for troubling and complaining about my mistake.

I’m using v1.20.3 and following the Costello analysis. As I understand it the trim.seqs command early in the analysis generates the *.groups file and no other commands between it and screen.seqs make another group file. The screen.seqs command produces errors saying that sequence name XXXX is not in the *.groups file. But when I search for XXXX in the *.groups file it is there. I am not sure if freesci’s solution will fix this problem, and if it will, I do not know how to generate the correct groups file so that I can finish my analysis. Any help with this will be much appreciated.

You shouldn’t need any ancillary code if you’re starting from trim.seqs. Can you post a copy of the exact commands you are entering with the exact file names?

Pat

Copied from my mothur log file:

trim.seqs(fasta=AGGAACCA.seqs, oligos=july.oligos, qfile=AGGAACCA.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50, processors=2)

unique.seqs(fasta=AGGAACCA.trim.fasta)

align.seqs(fasta=AGGAACCA.trim.unique.fasta, reference=silva.bacteria.fasta, flip=t, processors=2)

screen.seqs(fasta=AGGAACCA.trim.unique.align, name=AGGAACCA.trim.names, group=AGGAACCA.groups, minlength=100, end=13125)

Here is the error text:

“Processing sequence: 975
Your groupfile does not include the sequence G4PJNUV02F07X3 please correct.
Your groupfile does not include the sequence G4PJNUV02F26F5 please correct.”

It repeats this for about 1000 seqeunces then reports:

"Output File Names:
AGGAACCA.trim.unique.good.align
AGGAACCA.trim.unique.bad.accnos
AGGAACCA.trim.good.names
AGGAACCA.good.groups


It took 6 secs to screen 1950 sequences."

I checked several of the names it has listed as being not in the groupfile (G4PJNUV02F07X3…etc) each of them are in the AGGAACCA.groups file which was called in the screen.seqs command. Did I miss something in the Costello analysis that would have generated a groups file the screen.seqs command can work with? Am I using the wrong names file somehow? I am still learning mothur and I’m sorry if it’s just a beginner’s mistake. Thank you for your help!

hmmm… weird. could you possibly try emailing us AGGAACCA.seqs, AGGAACCA.qual, and july.oligos (mothur.bugs@gmail.com)?

pat

I’m having the same problem with my groupfile not including the sequences (which are there), any solution?

umjdf - please email your files to the mothur.bugs email address.

Total thread necro, but just had the same problem again and google led me to my own post, which I thought I should amend to include something possibly useful to others.

Not sure if this solution is good for all, but in my case(s), I enter into the SOP just prior to Silva alignment. I make my names and groups files external to mothur. The problem is that my groups file has lots of sequences IDs that are not in my input fasta file and while the screen.seqs() before filter.seqs() takes the group file without complaint, the pre.cluster() program will balk at the inconsistencies.

So if anyone else stumbles here, be sure to clean up the groups file to only include IDs that are in the names file (and thus in the fasta file) that is being used.