I have tried to follow the Costello pipeline exactly using a single fasta file with all sequences and a groups file that partitions the input sequences into 3 groups (A,B,C).
I am at the read.otu() step and just cannot figure out where I have gone wrong. It would seem that I did not supply the group file somewhere, but I have repeated it 3 times now following the Costello wiki page on a second monitor and cannot figure out where. Plus, I actually downloaded the sample data and followed the Costello pipeline and that worked fine. Using the same script on my data fails at the read.otu() step…
First, the pipeline:
#!/bin/bash
INPUT="input.clean"
#
# input.clean.fasta and input.clean.groups
#
silvaBugs=/bank/silva/silva.bacteria/silva.bacteria.fasta
silvaGold=/bank/silva/silva.bacteria/silva.gold.align
ln -s $silvaBugs .
ln -s $silvaGold .
ncpu="processors=5"
mothur "#unique.seqs(fasta="$INPUT".fasta)"
mothur "#align.seqs(candidate="$INPUT".unique.fasta, template=silva.bacteria.fasta, "$ncpu")"
#mothur "#summary.seqs(fasta="$INPUT".unique.align)"
## Start End NBases Ambigs Polymer
##Minimum: 1044 1044 1 0 1
##2.5%-tile: 1044 5249 202 0 4
##25%-tile: 1044 6324 288 0 4
##Median: 1044 6421 323 0 5
##75%-tile: 1044 6421 333 0 5
##97.5%-tile: 2090 6443 360 0 6
##Maximum: 43107 43116 392 0 7
### of Seqs: 189245
mothur "#screen.seqs(fasta="$INPUT".unique.align, name="$INPUT".names, group="$INPUT".groups, "$ncpu", minlength=200, maxlength=400, start=1044, end=6421)"
mothur "#filter.seqs(fasta="$INPUT".unique.good.align-silva.gold.align, vertical=T, "$ncpu")"
mothur "#chimera.slayer(fasta="$INPUT".unique.good.filter.fasta, template=silva.gold.filter.fasta, "$ncpu")"
mothur "#remove.seqs(accnos="$INPUT".unique.good.filter.slayer.accnos, fasta="$INPUT".unique.good.filter.fasta, name="$INPUT".good.names, group="$INPUT".good.groups, dups=T)"
mothur "#filter.seqs(fasta="$INPUT".unique.good.filter.pick.fasta, vertical=T, trump=., "$ncpu")"
mothur "#unique.seqs(fasta="$INPUT".unique.good.filter.pick.filter.fasta, name="$INPUT".good.pick.names)"
mothur "#pre.cluster(fasta="$INPUT".unique.good.filter.pick.filter.unique.fasta, name="$INPUT".unique.good.filter.pick.filter.names, diffs=2)"
mothur "#dist.seqs(fasta="$INPUT".unique.good.filter.pick.filter.unique.precluster.fasta, output=lt, "$ncpu")"
mothur "#read.dist(phylip="$INPUT".unique.good.filter.pick.filter.unique.precluster.phylip.dist, name="$INPUT".unique.good.filter.pick.filter.unique.precluster.names, cutoff=0.20);cluster(method=average)"
mothur "#read.otu(list="$INPUT".unique.good.filter.pick.filter.unique.precluster.phylip.an.list, group="$INPUT".good.pick.editted.groups)"
The read.otu command complains:
Your group file contains 481797 sequences and list file contains 281518 sequences. Please correct.
For a list of names that are in your group file and not in your list file, please refer to input.clean.unique.good.filter.pick.filter.unique.precluster.phylip.an.missing.name.
Now here comes the question: why does read.otu think there are that many sequences in the list file, for me there are 573,092 IDs but only 67,306 unique IDs
awk '{print $3}' input.clean.unique.good.filter.pick.filter.unique.precluster.phylip.an.list | sed 's/,/\n/g' | wc
573092 573092 8596380
awk '{print $3}' input.clean.unique.good.filter.pick.filter.unique.precluster.phylip.an.list | sed 's/,/\n/g' | sort -u | wc
67306 67306 1009590
I wrote a program that reads in the list file and makes a unique list of the IDs and then reads the group file and only outputs lines with an ID from the list file, but that gives me a list file with 67,306 IDs in it and read.otu() wont run with that either.
And yes, my group file is ID{tab}A|B|C formatted properly. Meaning I only have two tokens, separated by a tab, on each line, the ID and then the group letter A, B or C.
Thanks for reading!