Hi,
Recently I have stumbled upon an issue - cluster creates list files in which random characters inside sequence names are sometimes replaced with undecipherable unicode characters. (Boxes filled with numbers, most text editors etc. either dont recognize them or sometimes recognize these as question marks “?” or box fill characters.
Discovered this when the next step - command make shared printed an error and then the mothur terminated.
Error: Sequence ‘DBNW5DQ1:135:C092BACXX:8:2205:12861:16187:1:N:0:CGATGT_79bp_79.0_0.82_4B691’ was not found in the group file, please correct.
I do try to replace the sequence in the list file with a correct one DBNW5DQ1:135:C092BACXX:8:2205:12861:116187:1:N:0:CGATGT_79bp_79.0_0.82_4B691’ by using a perl script. I’ll give notice if that helped or are there any other troubles (such as next sequence name that is altered and will cause mothur to crash or somesuch)
Could you list the commands you ran before this as well as your version of mothur and OS?
trim.seqs(fasta=stoma.fa,qfile=stoma.quala, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=25, processors=7)
system(./grupp_stoma.pl stoma.trim.fasta > stoma.groups)
unique.seqs(fasta=stoma.trim.fasta)
align.seqs(fasta=stoma.trim.unique.fasta, reference=silva.bacteria.fasta, processors=7,flip=T)
summary.seqs(fasta=stoma.trim.unique.align)
screen.seqs(fasta=stoma.trim.unique.align, name=stoma.trim.names, group=stoma.groups, end=34113,start=31189,minlength=65,processors=7)
summary.seqs(fasta=current)
filter.seqs(fasta=stoma.trim.unique.good.align, vertical=T, trump=., processors=7)
unique.seqs(fasta=stoma.trim.unique.good.filter.fasta, name=stoma.trim.good.names)
pre.cluster(fasta=stoma.trim.unique.good.filter.unique.fasta, name=stoma.trim.unique.good.filter.names, group=stoma.good.groups, diffs=1)
chimera.uchime(fasta=stoma.trim.unique.good.filter.unique.precluster.fasta, name=stoma.trim.unique.good.filter.unique.precluster.names, group=stoma.good.groups, processors=7)
remove.seqs(accnos=stoma.trim.unique.good.filter.unique.precluster.uchime.accnos, fasta=stoma.trim.unique.good.filter.unique.precluster.fasta, name=stoma.trim.unique.good.filter.unique.precluster.names, group=stoma.good.groups)
system(mv stoma.trim.unique.good.filter.unique.precluster.pick.names stoma.final.names)
system(mv stoma.trim.unique.good.filter.unique.precluster.pick.fasta stoma.final.fasta)
system(mv stoma.good.pick.groups stoma.final.groups)
remove.groups(fasta=stoma.final.fasta, name=stoma.final.names, group=stoma.final.groups, groups=1B344B2-1B421B2-1B536B2-2B164B2-2B294B2-2B540B2-3B172B2-3B369B2-3B496B2-4B647B2-4B714B2-4B771B2)
classify.seqs(fasta=stoma.final.pick.fasta,template=gg_99.pds.ng.fasta, taxonomy=gg_99.pds.tax, cutoff=80, processors=7)
remove.lineage(taxonomy=stoma.final.pick.pds.taxonomy, name=stoma.final.pick.names, group=stoma.proovikaupa.pick.groups, fasta=stoma.final.pick.fasta, taxon=k__Archaea;unclassified;-k__Bacteria;unclassified-Root;unclassified;, dups=T)
system(mv stoma.final.pick.pick.names stomaP.final.names)
system(mv stoma.final.pick.pick.fasta stomaP.final.fasta)
system(mv stoma.proovikaupa.pick.pick.groups stomaP.final.groups)
system(mv stoma.final.pick.pds.pick.taxonomy stomaP.final.taxonomy)
system(./kohtadeks.pl stomaP.final.groups > stomaP.kohtadekaupa.groups)
dist.seqs(fasta=stomaP.final.fasta, cutoff=0.10, processors=7)
cluster(column=stomaP.final.dist, name=stomaP.final.names,method=furthest, cutoff=0.05)
make.shared(list=stomaP.final.fn.list, group=stomaP.kohtadekaupa.groups, label=0.05)
The first script creates groupfile by taking seqnames like >jadajada_4B562 and writing out jadajada/t4B562, second one takes groupfile and modifies it from
jadajada4B562/t4B562 to jadajada4B562/t4 (The 4 before B denotes a subject numbers after it a sample)
My mothur version is mothur v.1.23.0, Last updated: 1/9/2012, 64 bit version. Ran on Ubuntu 11.04 - the Natty Narwhal.
Do the names look okay in the dist and name file that go into cluster? We have had a few bug reports from LInux users that seem similar to yours but the problem is occurring in dist.seqs with multiple processors. Also, do you know the version of g++ you are using? g++ --version should tell you.
Unfortunately I have already mucked around in many of my files and hadn’t kept a good record.
It might be that it is a problem in the dist part. (I already had similar problem once when cluster (next step from dist) complained about sequence names being out of synch - I made a little post about it in the “multiple processors” thread)
I try to run my pipeline again in a few days if possible and keep a good record up until the problem.
I took a look and it seems that the replacements start at the list file. The files being large I cannot search them otherwise than to use some perl scriopts.
I used a simple script to look into dist and list files.
while (my $line = <>) {
if ($line=~m//){print “+”;}
}
It did report erplacements in list but not in dist files. I was able to continue with my analyses by running the list. file to a script that replaced the instances of , with 1. (found out that it was in this place in the original name.) It seems to replace random ones with a box symbol filled with 0011 (left to right top to bottom).
Also.
My g++ -version is
gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)
The names in the name file are okay, right? Also, would it be possible for you to send a small sample of your seqs so we can try to reproduce and resolve the issue? mothur.bugs@gmail.com