mothur

splitting merging II


#1

Hi all,

I would be extremely grateful if someone could give me a hint on what’s going wrong here – maybe I’m just completely barking up the wrong tree… :?

For working with large datasets is sometimes seems opportune to split sample groups and re-merge when necessary. Doing that I often get the dreaded sequence mismatch error ([ERROR]: Your count table contains more than 1 sequence named XXXXXXXXXXXXXXXXXX, sequence names must be unique. Please correct.). I realise that in the majority of cases this is caused by some mistake in file naming or group name mix-ups.

Nevertheless, below an example where I reproduced the sequence mismatch error from a test with only two groups (samples) which I processed in the usual SOP aligned way, then split and tried to merge again – to no avail.

The sequence output came from a commercial provider, run on a Hi-Seq and are environmental fungal ITS sequences (non aligned).
In brief, sequences were put through a standard mothur pipeline and after some basic clean-up split with split.groups and alternatively also with get.groups, the single group output was verified for integrity with summary.seqs and unique.seqs and used as input for merge.groups (which failed every time).

I noticed one strange thing: when running make.contigs with just one group in compressed gz file format, the module reproducibly crashed with a seg fault:” is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.” This didn’t occur if I decompressed the fastq.gz files manually or if the gz files were fed in by a three column list file. Just to test if this strange behaviour might be related to the split/merge problem, I run the test with standard contiged files and also with the same files but contiged in a somewhat elaborate way: fastq.info > make.contigs per sample with fasta & qual file input > make.group & merge.files – same final result (error). From the fastq.info output I also compared the fasta identifiers first 500 sequences of the two groups – no identical sequence identifiers (is there a good tool to do this for entire large files?).


Again, any help would be greatly appreciated!!

Tom



testing make.contigs

with gz compressed files

make.contigs(ffastq=891345-lib232126_R1.fastq.gz, rfastq=891345-lib232126_R2.fastq.gz)
Seg fault
is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.

make.contigs(ffastq=891355-lib232093_R1.fastq.gz, rfastq=891355-lib232093_R2.fastq.gz)
Seg fault
is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.

with decompressed files

gunzip *.fastq.gz

make.contigs(ffastq=891345_lib232126_R1.fastq, rfastq=891345_lib232126_R2.fastq)
Output File Names:
891345_lib232126_R1.trim.contigs.fasta
891345_lib232126_R1.scrap.contigs.fasta
891345_lib232126_R1.trim.contigs.qual
891345_lib232126_R1.scrap.contigs.qual
891345_lib232126_R1.contigs.report

make.contigs(ffastq=891355_lib232093_R1.fastq, rfastq=891355_lib232093_R2.fastq)
Output File Names:
891355_lib232093_R1.trim.contigs.fasta
891355_lib232093_R1.scrap.contigs.fasta
891355_lib232093_R1.trim.contigs.qual
891355_lib232093_R1.scrap.contigs.qual
891355_lib232093_R1.contigs.report

Recompessed:

make.contigs(ffastq=891355_lib232093_R1.fastq.gz, rfastq=891355_lib232093_R2.fastq.gz)
‹±«.Z
is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
Seg fault

getting fastq.info output

fastq.info(fastq=891345_lib232126_R1.fastq)
891345_lib232126_R1.qual
891345_lib232126_R1.fasta

fastq.info(fastq=891345_lib232126_R2.fastq)
891345_lib232126_R2.qual
891345_lib232126_R2.fasta

fastq.info(fastq=891355_lib232093_R1.fastq)
891355_lib232093_R1.qual
891355_lib232093_R1.fasta

fastq.info(fastq=891355_lib232093_R2.fastq)
891355_lib232093_R2.qual
891355_lib232093_R2.fasta

system(head -n 1000 891345_lib232126_R1.fasta > head.891345.R1.fasta)
system(head -n 1000 891355_lib232093_R1.fasta > head.891355.R1.fasta)

Compared the first 500 fasta headers (sequence names) of 891345 & 891355 = No identicals

system(gzip *.fastq)

#standard make.contigs and cleanup

test.files

891355 891355_lib232093_R1.fastq.gz 891355_lib232093_R2.fastq.gz
891345 891345_lib232126_R1.fastq.gz 891345_lib232126_R2.fastq.gz

make.contigs(file=test.files, processors=32)

summary.seqs(fasta=test.trim.contigs.fasta, processors=32)
screen.seqs(fasta=test.trim.contigs.fasta, maxambig=0, maxhomop=8, minlength=308, maxlength=445, processors=32, group=test.contigs.groups)
summary.seqs(fasta=test.trim.contigs.good.fasta)
unique.seqs(fasta=test.trim.contigs.good.fasta)
count.seqs(name=test.trim.contigs.good.names, group=test.contigs.good.groups)
summary.seqs(count=test.trim.contigs.good.count_table)
chimera.vsearch(fasta=test.trim.contigs.good.unique.fasta, count=test.trim.contigs.good.count_table)
remove.seqs(fasta=test.trim.contigs.good.unique.fasta, count=test.trim.contigs.good.count_table, accnos=test.trim.contigs.good.unique.denovo.vsearch.accnos)

#checking integrity

summary.seqs(count=test.trim.contigs.good.pick.count_table, fasta=test.trim.contigs.good.unique.pick.fasta)


Start End NBases Ambigs Polymer NumSeqs Minimum: 1 308 308 0 3 1 2.5%-tile: 1 308 308 0 4 8027 25%-tile: 1 316 316 0 5 80270 Median: 1 322 322 0 6 160539 75%-tile: 1 323 323 0 6 240808 97.5%-tile: 1 370 370 0 7 313051 Maximum: 1 442 442 0 8 321077 Mean: 1 322.905 322.905 0 5.56995 # of unique seqs: 251368 total # of seqs: 321077
count.groups(count=test.trim.contigs.good.pick.count_table) 891345 contains 174348. 891355 contains 146729.

Total seqs: 321077.

unique.seqs(count=test.trim.contigs.good.pick.count_table, fasta=test.trim.contigs.good.unique.pick.fasta)
No new uniques but carry on with re-uniqued files
251368 251368

Output File Names:
test.trim.contigs.good.unique.pick.count_table
test.trim.contigs.good.unique.pick.unique.fasta

summary.seqs(count=test.trim.contigs.good.unique.pick.count_table, fasta=test.trim.contigs.good.unique.pick.unique.fasta)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 308 308 0 3 1
2.5%-tile: 1 308 308 0 4 8027
25%-tile: 1 316 316 0 5 80270
Median: 1 322 322 0 6 160539
75%-tile: 1 323 323 0 6 240808
97.5%-tile: 1 370 370 0 7 313051
Maximum: 1 442 442 0 8 321077
Mean: 1 322.905 322.905 0 5.56995

of unique seqs: 251368

total # of seqs: 321077

input for split and merge test


split.groups(count=test.trim.contigs.good.unique.pick.count_table, fasta=test.trim.contigs.good.unique.pick.unique.fasta)

test.trim.contigs.good.unique.pick.unique.891345.fasta
test.trim.contigs.good.unique.pick.891345.count_table
test.trim.contigs.good.unique.pick.unique.891355.fasta
test.trim.contigs.good.unique.pick.891355.count_table

unique.seqs(count=test.trim.contigs.good.unique.pick.891345.count_table, fasta=test.trim.contigs.good.unique.pick.unique.891345.fasta)
No new uniques
140645 140645

Output File Names:
test.trim.contigs.good.unique.pick.unique.891345.count_table
test.trim.contigs.good.unique.pick.unique.891345.unique.fasta
summary.seqs(count=test.trim.contigs.good.unique.pick.unique.891345.count_table, fasta=test.trim.contigs.good.unique.pick.unique.891345.unique.fasta)
#good

unique.seqs(count=test.trim.contigs.good.unique.pick.891355.count_table, fasta=test.trim.contigs.good.unique.pick.unique.891355.fasta)
No new uniques
114485 114485

Output File Names:
test.trim.contigs.good.unique.pick.unique.891355.count_table
test.trim.contigs.good.unique.pick.unique.891355.unique.fasta

summary.seqs(count=test.trim.contigs.good.unique.pick.unique.891355.count_table, fasta=test.trim.contigs.good.unique.pick.unique.891355.unique.fasta)
#good

merge.count(count=test.trim.contigs.good.unique.pick.unique.891355.count_table-test.trim.contigs.good.unique.pick.unique.891345.count_table, output=test.merge.count_table)

[ERROR]: Your count table contains more than 1 sequence named D00259_626_H2KTHBCX2_2_1101_10097_41217, sequence names must be unique. Please correct.


alternative way to split groups

get.groups(count=test.trim.contigs.good.unique.pick.count_table, fasta=test.trim.contigs.good.unique.pick.unique.fasta, groups=891355)

test.trim.contigs.good.unique.pick.unique.pick.fasta
test.trim.contigs.good.unique.pick.pick.count_table

system(mv test.trim.contigs.good.unique.pick.unique.pick.fasta test.get.891355.fasta)
system(mv test.trim.contigs.good.unique.pick.pick.count_table test.get.891355.count_table)

get.groups(count=test.trim.contigs.good.unique.pick.count_table, fasta=test.trim.contigs.good.unique.pick.unique.fasta, groups=891345)

system(mv test.trim.contigs.good.unique.pick.unique.pick.fasta test.get.891345.fasta)
system(mv test.trim.contigs.good.unique.pick.pick.count_table test.get.891345.count_table)

merge.count(count=test.get.891345.count_table-test.get.891355.count_table, output=test.merge2.count_table)
[ERROR]: Your count table contains more than 1 sequence named D00259_626_H2KTHBCX2_2_1101_10097_41217, sequence names must be unique. Please correct.


#alternative make.contigs from fastq.info output

891345_lib232126_R2.qual
891345_lib232126_R2.fasta
891345_lib232126_R1.qual
891345_lib232126_R1.fasta

891355_lib232093_R2.qual
891355_lib232093_R2.fasta
891355_lib232093_R1.qual
891355_lib232093_R1.fasta

make.contigs(ffasta=891345_lib232126_R1.fasta, rfasta=891345_lib232126_R2.fasta, fqfile=891345_lib232126_R1.qual, rqfile=891345_lib232126_R2.qual)

Output File Names:
891345_lib232126_R1.trim.contigs.fasta
891345_lib232126_R1.scrap.contigs.fasta
891345_lib232126_R1.trim.contigs.qual
891345_lib232126_R1.scrap.contigs.qual
891345_lib232126_R1.contigs.report

make.contigs(ffasta=891355_lib232093_R1.fasta, rfasta=891355_lib232093_R2.fasta, fqfile=891355_lib232093_R1.qual, rqfile=891355_lib232093_R2.qual, processors=56)

Output File Names:
891355_lib232093_R1.trim.contigs.fasta
891355_lib232093_R1.scrap.contigs.fasta
891355_lib232093_R1.trim.contigs.qual
891355_lib232093_R1.scrap.contigs.qual
891355_lib232093_R1.contigs.report

make.group(fasta=891355_lib232093_R1.trim.contigs.fasta-891345_lib232126_R1.trim.contigs.fasta, groups=891355-891345)
Output File Names: 891355_lib232093_R1.trim.contigs.891345_lib232126_R1.trim.contigs.groups

system(mv 891355_lib232093_R1.trim.contigs.891345_lib232126_R1.trim.contigs.groups test2.contigs.groups)

merge.files(input=891345_lib232126_R1.trim.contigs.fasta-891355_lib232093_R1.trim.contigs.fasta, output=test2.trim.contigs.fasta)

summary.seqs(fasta=test2.trim.contigs.fasta, processors=56)
screen.seqs(fasta=test2.trim.contigs.fasta, maxambig=0, maxhomop=8, minlength=308, maxlength=445, processors=32, group=test2.contigs.groups)
summary.seqs(fasta=test2.trim.contigs.good.fasta)
unique.seqs(fasta=test2.trim.contigs.good.fasta)
count.seqs(name=test2.trim.contigs.good.names, group=test2.contigs.good.groups)
summary.seqs(count=test2.trim.contigs.good.count_table)
chimera.vsearch(fasta=test2.trim.contigs.good.unique.fasta, count=test2.trim.contigs.good.count_table)
remove.seqs(fasta=test2.trim.contigs.good.unique.fasta, count=test2.trim.contigs.good.count_table, accnos=test2.trim.contigs.good.unique.denovo.vsearch.accnos)
summary.seqs(count=test2.trim.contigs.good.pick.count_table, fasta=test2.trim.contigs.good.unique.pick.fasta)


Start End NBases Ambigs Polymer NumSeqs Minimum: 1 308 308 0 3 1 2.5%-tile: 1 308 308 0 4 8028 25%-tile: 1 316 316 0 5 80274 Median: 1 322 322 0 6 160547 75%-tile: 1 323 323 0 6 240820 97.5%-tile: 1 370 370 0 7 313065 Maximum: 1 442 442 0 8 321092 Mean: 1 322.904 322.904 0 5.56999 # of unique seqs: 251374 total # of seqs: 321092

unique.seqs(count=test2.trim.contigs.good.pick.count_table, fasta=test2.trim.contigs.good.unique.pick.fasta)
No new uniques

unique.seqs(count=test2.trim.contigs.good.unique.pick.891345.count_table, fasta=test2.trim.contigs.good.unique.pick.unique.891345.fasta)
unique.seqs(count=test2.trim.contigs.good.unique.pick.891355.count_table, fasta=test2.trim.contigs.good.unique.pick.unique.891355.fasta)

merge.count(count=test2.trim.contigs.good.unique.pick.unique.891355.count_table-test2.trim.contigs.good.unique.pick.unique.891345.count_table, output=test2.merge.count_table)
[ERROR]: Your count table contains more than 1 sequence named D00259_626_H2KTHBCX2_2_1101_10010_83780, sequence names must be unique. Please correct.


#2

Let me give you an example to explain what is going on.

Fasta file:

seq1
attgc…
seq2
ggcta…

Count file:
Representative_Sequence total A B C
seq1 3 1 2 0
seq2 10 1 4 5

The split.groups command would result in 6 files.

Fasta and count files for sample A, B would contain seq1 and seq2, where as the files for sample C would only contain seq2.

NOTE: Split.groups does NOT change any sequence names. This is done so that you can easily relate the reads in the individual samples to the original reads in the complete dataset. When you try to merge the files, mothur is finding seq1 in all 3 samples and complaining because it’s not sure which fasta and count information you want to preserve for seq1.

A workaround (will not preserve sequence names):

mothur > rename.seqs(fasta=sampleA.fasta, count=sampleA.count_table)
mothur > rename.seqs(fasta=sampleB.fasta, count=sampleB.count_table)
mothur > rename.seqs(fasta=sampleC.fasta, count=sampleC.count_table)
mothur > merge.files(input=sampleA.fasta-sampleB.fasta-sampleC.fasta, output=merged.fasta)
mothur > merge.count(count=sampleA.count_table-sampleB.count_table-sampleC.count_table, output=merged.count_table)
mothur > unique.seqs(fasta=merged.fasta, count=merged.count_table)