Hi all,
I would be extremely grateful if someone could give me a hint on what’s going wrong here – maybe I’m just completely barking up the wrong tree… :?
For working with large datasets is sometimes seems opportune to split sample groups and re-merge when necessary. Doing that I often get the dreaded sequence mismatch error ([ERROR]: Your count table contains more than 1 sequence named XXXXXXXXXXXXXXXXXX, sequence names must be unique. Please correct.). I realise that in the majority of cases this is caused by some mistake in file naming or group name mix-ups.
Nevertheless, below an example where I reproduced the sequence mismatch error from a test with only two groups (samples) which I processed in the usual SOP aligned way, then split and tried to merge again – to no avail.
The sequence output came from a commercial provider, run on a Hi-Seq and are environmental fungal ITS sequences (non aligned).
In brief, sequences were put through a standard mothur pipeline and after some basic clean-up split with split.groups and alternatively also with get.groups, the single group output was verified for integrity with summary.seqs and unique.seqs and used as input for merge.groups (which failed every time).
I noticed one strange thing: when running make.contigs with just one group in compressed gz file format, the module reproducibly crashed with a seg fault:” is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.” This didn’t occur if I decompressed the fastq.gz files manually or if the gz files were fed in by a three column list file. Just to test if this strange behaviour might be related to the split/merge problem, I run the test with standard contiged files and also with the same files but contiged in a somewhat elaborate way: fastq.info > make.contigs per sample with fasta & qual file input > make.group & merge.files – same final result (error). From the fastq.info output I also compared the fasta identifiers first 500 sequences of the two groups – no identical sequence identifiers (is there a good tool to do this for entire large files?).
Again, any help would be greatly appreciated!!
Tom
testing make.contigs
with gz compressed files
make.contigs(ffastq=891345-lib232126_R1.fastq.gz, rfastq=891345-lib232126_R2.fastq.gz)
Seg fault
is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
make.contigs(ffastq=891355-lib232093_R1.fastq.gz, rfastq=891355-lib232093_R2.fastq.gz)
Seg fault
is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
with decompressed files
gunzip *.fastq.gz
make.contigs(ffastq=891345_lib232126_R1.fastq, rfastq=891345_lib232126_R2.fastq)
Output File Names:
891345_lib232126_R1.trim.contigs.fasta
891345_lib232126_R1.scrap.contigs.fasta
891345_lib232126_R1.trim.contigs.qual
891345_lib232126_R1.scrap.contigs.qual
891345_lib232126_R1.contigs.report
make.contigs(ffastq=891355_lib232093_R1.fastq, rfastq=891355_lib232093_R2.fastq)
Output File Names:
891355_lib232093_R1.trim.contigs.fasta
891355_lib232093_R1.scrap.contigs.fasta
891355_lib232093_R1.trim.contigs.qual
891355_lib232093_R1.scrap.contigs.qual
891355_lib232093_R1.contigs.report
Recompessed:
make.contigs(ffastq=891355_lib232093_R1.fastq.gz, rfastq=891355_lib232093_R2.fastq.gz)
‹±«.Z
is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
Seg fault
getting fastq.info output
fastq.info(fastq=891345_lib232126_R1.fastq)
891345_lib232126_R1.qual
891345_lib232126_R1.fasta
fastq.info(fastq=891345_lib232126_R2.fastq)
891345_lib232126_R2.qual
891345_lib232126_R2.fasta
fastq.info(fastq=891355_lib232093_R1.fastq)
891355_lib232093_R1.qual
891355_lib232093_R1.fasta
fastq.info(fastq=891355_lib232093_R2.fastq)
891355_lib232093_R2.qual
891355_lib232093_R2.fasta
system(head -n 1000 891345_lib232126_R1.fasta > head.891345.R1.fasta)
system(head -n 1000 891355_lib232093_R1.fasta > head.891355.R1.fasta)
Compared the first 500 fasta headers (sequence names) of 891345 & 891355 = No identicals
system(gzip *.fastq)
#standard make.contigs and cleanup
test.files
891355 891355_lib232093_R1.fastq.gz 891355_lib232093_R2.fastq.gz
891345 891345_lib232126_R1.fastq.gz 891345_lib232126_R2.fastq.gz
make.contigs(file=test.files, processors=32)
summary.seqs(fasta=test.trim.contigs.fasta, processors=32)
screen.seqs(fasta=test.trim.contigs.fasta, maxambig=0, maxhomop=8, minlength=308, maxlength=445, processors=32, group=test.contigs.groups)
summary.seqs(fasta=test.trim.contigs.good.fasta)
unique.seqs(fasta=test.trim.contigs.good.fasta)
count.seqs(name=test.trim.contigs.good.names, group=test.contigs.good.groups)
summary.seqs(count=test.trim.contigs.good.count_table)
chimera.vsearch(fasta=test.trim.contigs.good.unique.fasta, count=test.trim.contigs.good.count_table)
remove.seqs(fasta=test.trim.contigs.good.unique.fasta, count=test.trim.contigs.good.count_table, accnos=test.trim.contigs.good.unique.denovo.vsearch.accnos)
#checking integrity
summary.seqs(count=test.trim.contigs.good.pick.count_table, fasta=test.trim.contigs.good.unique.pick.fasta)
Start End NBases Ambigs Polymer NumSeqs Minimum: 1 308 308 0 3 1 2.5%-tile: 1 308 308 0 4 8027 25%-tile: 1 316 316 0 5 80270 Median: 1 322 322 0 6 160539 75%-tile: 1 323 323 0 6 240808 97.5%-tile: 1 370 370 0 7 313051 Maximum: 1 442 442 0 8 321077 Mean: 1 322.905 322.905 0 5.56995 # of unique seqs: 251368 total # of seqs: 321077
count.groups(count=test.trim.contigs.good.pick.count_table) 891345 contains 174348. 891355 contains 146729.
Total seqs: 321077.
unique.seqs(count=test.trim.contigs.good.pick.count_table, fasta=test.trim.contigs.good.unique.pick.fasta)
No new uniques but carry on with re-uniqued files
251368 251368
Output File Names:
test.trim.contigs.good.unique.pick.count_table
test.trim.contigs.good.unique.pick.unique.fasta
summary.seqs(count=test.trim.contigs.good.unique.pick.count_table, fasta=test.trim.contigs.good.unique.pick.unique.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 308 308 0 3 1
2.5%-tile: 1 308 308 0 4 8027
25%-tile: 1 316 316 0 5 80270
Median: 1 322 322 0 6 160539
75%-tile: 1 323 323 0 6 240808
97.5%-tile: 1 370 370 0 7 313051
Maximum: 1 442 442 0 8 321077
Mean: 1 322.905 322.905 0 5.56995
of unique seqs: 251368
total # of seqs: 321077
input for split and merge test
split.groups(count=test.trim.contigs.good.unique.pick.count_table, fasta=test.trim.contigs.good.unique.pick.unique.fasta)
test.trim.contigs.good.unique.pick.unique.891345.fasta
test.trim.contigs.good.unique.pick.891345.count_table
test.trim.contigs.good.unique.pick.unique.891355.fasta
test.trim.contigs.good.unique.pick.891355.count_table
unique.seqs(count=test.trim.contigs.good.unique.pick.891345.count_table, fasta=test.trim.contigs.good.unique.pick.unique.891345.fasta)
No new uniques
140645 140645
Output File Names:
test.trim.contigs.good.unique.pick.unique.891345.count_table
test.trim.contigs.good.unique.pick.unique.891345.unique.fasta
summary.seqs(count=test.trim.contigs.good.unique.pick.unique.891345.count_table, fasta=test.trim.contigs.good.unique.pick.unique.891345.unique.fasta)
#good
unique.seqs(count=test.trim.contigs.good.unique.pick.891355.count_table, fasta=test.trim.contigs.good.unique.pick.unique.891355.fasta)
No new uniques
114485 114485
Output File Names:
test.trim.contigs.good.unique.pick.unique.891355.count_table
test.trim.contigs.good.unique.pick.unique.891355.unique.fasta
summary.seqs(count=test.trim.contigs.good.unique.pick.unique.891355.count_table, fasta=test.trim.contigs.good.unique.pick.unique.891355.unique.fasta)
#good
merge.count(count=test.trim.contigs.good.unique.pick.unique.891355.count_table-test.trim.contigs.good.unique.pick.unique.891345.count_table, output=test.merge.count_table)
[ERROR]: Your count table contains more than 1 sequence named D00259_626_H2KTHBCX2_2_1101_10097_41217, sequence names must be unique. Please correct.
alternative way to split groups
get.groups(count=test.trim.contigs.good.unique.pick.count_table, fasta=test.trim.contigs.good.unique.pick.unique.fasta, groups=891355)
test.trim.contigs.good.unique.pick.unique.pick.fasta
test.trim.contigs.good.unique.pick.pick.count_table
system(mv test.trim.contigs.good.unique.pick.unique.pick.fasta test.get.891355.fasta)
system(mv test.trim.contigs.good.unique.pick.pick.count_table test.get.891355.count_table)
get.groups(count=test.trim.contigs.good.unique.pick.count_table, fasta=test.trim.contigs.good.unique.pick.unique.fasta, groups=891345)
system(mv test.trim.contigs.good.unique.pick.unique.pick.fasta test.get.891345.fasta)
system(mv test.trim.contigs.good.unique.pick.pick.count_table test.get.891345.count_table)
merge.count(count=test.get.891345.count_table-test.get.891355.count_table, output=test.merge2.count_table)
[ERROR]: Your count table contains more than 1 sequence named D00259_626_H2KTHBCX2_2_1101_10097_41217, sequence names must be unique. Please correct.
#alternative make.contigs from fastq.info output
891345_lib232126_R2.qual
891345_lib232126_R2.fasta
891345_lib232126_R1.qual
891345_lib232126_R1.fasta
891355_lib232093_R2.qual
891355_lib232093_R2.fasta
891355_lib232093_R1.qual
891355_lib232093_R1.fasta
make.contigs(ffasta=891345_lib232126_R1.fasta, rfasta=891345_lib232126_R2.fasta, fqfile=891345_lib232126_R1.qual, rqfile=891345_lib232126_R2.qual)
Output File Names:
891345_lib232126_R1.trim.contigs.fasta
891345_lib232126_R1.scrap.contigs.fasta
891345_lib232126_R1.trim.contigs.qual
891345_lib232126_R1.scrap.contigs.qual
891345_lib232126_R1.contigs.report
make.contigs(ffasta=891355_lib232093_R1.fasta, rfasta=891355_lib232093_R2.fasta, fqfile=891355_lib232093_R1.qual, rqfile=891355_lib232093_R2.qual, processors=56)
Output File Names:
891355_lib232093_R1.trim.contigs.fasta
891355_lib232093_R1.scrap.contigs.fasta
891355_lib232093_R1.trim.contigs.qual
891355_lib232093_R1.scrap.contigs.qual
891355_lib232093_R1.contigs.report
make.group(fasta=891355_lib232093_R1.trim.contigs.fasta-891345_lib232126_R1.trim.contigs.fasta, groups=891355-891345)
Output File Names: 891355_lib232093_R1.trim.contigs.891345_lib232126_R1.trim.contigs.groups
system(mv 891355_lib232093_R1.trim.contigs.891345_lib232126_R1.trim.contigs.groups test2.contigs.groups)
merge.files(input=891345_lib232126_R1.trim.contigs.fasta-891355_lib232093_R1.trim.contigs.fasta, output=test2.trim.contigs.fasta)
summary.seqs(fasta=test2.trim.contigs.fasta, processors=56)
screen.seqs(fasta=test2.trim.contigs.fasta, maxambig=0, maxhomop=8, minlength=308, maxlength=445, processors=32, group=test2.contigs.groups)
summary.seqs(fasta=test2.trim.contigs.good.fasta)
unique.seqs(fasta=test2.trim.contigs.good.fasta)
count.seqs(name=test2.trim.contigs.good.names, group=test2.contigs.good.groups)
summary.seqs(count=test2.trim.contigs.good.count_table)
chimera.vsearch(fasta=test2.trim.contigs.good.unique.fasta, count=test2.trim.contigs.good.count_table)
remove.seqs(fasta=test2.trim.contigs.good.unique.fasta, count=test2.trim.contigs.good.count_table, accnos=test2.trim.contigs.good.unique.denovo.vsearch.accnos)
summary.seqs(count=test2.trim.contigs.good.pick.count_table, fasta=test2.trim.contigs.good.unique.pick.fasta)
Start End NBases Ambigs Polymer NumSeqs Minimum: 1 308 308 0 3 1 2.5%-tile: 1 308 308 0 4 8028 25%-tile: 1 316 316 0 5 80274 Median: 1 322 322 0 6 160547 75%-tile: 1 323 323 0 6 240820 97.5%-tile: 1 370 370 0 7 313065 Maximum: 1 442 442 0 8 321092 Mean: 1 322.904 322.904 0 5.56999 # of unique seqs: 251374 total # of seqs: 321092
unique.seqs(count=test2.trim.contigs.good.pick.count_table, fasta=test2.trim.contigs.good.unique.pick.fasta)
No new uniques
unique.seqs(count=test2.trim.contigs.good.unique.pick.891345.count_table, fasta=test2.trim.contigs.good.unique.pick.unique.891345.fasta)
unique.seqs(count=test2.trim.contigs.good.unique.pick.891355.count_table, fasta=test2.trim.contigs.good.unique.pick.unique.891355.fasta)
merge.count(count=test2.trim.contigs.good.unique.pick.unique.891355.count_table-test2.trim.contigs.good.unique.pick.unique.891345.count_table, output=test2.merge.count_table)
[ERROR]: Your count table contains more than 1 sequence named D00259_626_H2KTHBCX2_2_1101_10010_83780, sequence names must be unique. Please correct.