cluster list file created with replaced symbols in seq names

jenz · April 29, 2012, 5:09pm

Hi,
Recently I have stumbled upon an issue - cluster creates list files in which random characters inside sequence names are sometimes replaced with undecipherable unicode characters. (Boxes filled with numbers, most text editors etc. either dont recognize them or sometimes recognize these as question marks “?” or box fill characters.
Discovered this when the next step - command make shared printed an error and then the mothur terminated.
Error: Sequence ‘DBNW5DQ1:135:C092BACXX:8:2205:12861:16187:1:N:0:CGATGT_79bp_79.0_0.82_4B691’ was not found in the group file, please correct.
I do try to replace the sequence in the list file with a correct one DBNW5DQ1:135:C092BACXX:8:2205:12861:116187:1:N:0:CGATGT_79bp_79.0_0.82_4B691’ by using a perl script. I’ll give notice if that helped or are there any other troubles (such as next sequence name that is altered and will cause mothur to crash or somesuch)

westcott · May 1, 2012, 1:12pm

Could you list the commands you ran before this as well as your version of mothur and OS?

jenz · May 1, 2012, 1:48pm

trim.seqs(fasta=stoma.fa,qfile=stoma.quala, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=25, processors=7)
system(./grupp_stoma.pl stoma.trim.fasta > stoma.groups)
unique.seqs(fasta=stoma.trim.fasta)
align.seqs(fasta=stoma.trim.unique.fasta, reference=silva.bacteria.fasta, processors=7,flip=T)
summary.seqs(fasta=stoma.trim.unique.align)
screen.seqs(fasta=stoma.trim.unique.align, name=stoma.trim.names, group=stoma.groups, end=34113,start=31189,minlength=65,processors=7)
summary.seqs(fasta=current)
filter.seqs(fasta=stoma.trim.unique.good.align, vertical=T, trump=., processors=7)
unique.seqs(fasta=stoma.trim.unique.good.filter.fasta, name=stoma.trim.good.names)
pre.cluster(fasta=stoma.trim.unique.good.filter.unique.fasta, name=stoma.trim.unique.good.filter.names, group=stoma.good.groups, diffs=1)
chimera.uchime(fasta=stoma.trim.unique.good.filter.unique.precluster.fasta, name=stoma.trim.unique.good.filter.unique.precluster.names, group=stoma.good.groups, processors=7)
remove.seqs(accnos=stoma.trim.unique.good.filter.unique.precluster.uchime.accnos, fasta=stoma.trim.unique.good.filter.unique.precluster.fasta, name=stoma.trim.unique.good.filter.unique.precluster.names, group=stoma.good.groups)
system(mv stoma.trim.unique.good.filter.unique.precluster.pick.names stoma.final.names)
system(mv stoma.trim.unique.good.filter.unique.precluster.pick.fasta stoma.final.fasta)
system(mv stoma.good.pick.groups stoma.final.groups)
remove.groups(fasta=stoma.final.fasta, name=stoma.final.names, group=stoma.final.groups, groups=1B344B2-1B421B2-1B536B2-2B164B2-2B294B2-2B540B2-3B172B2-3B369B2-3B496B2-4B647B2-4B714B2-4B771B2)
classify.seqs(fasta=stoma.final.pick.fasta,template=gg_99.pds.ng.fasta, taxonomy=gg_99.pds.tax, cutoff=80, processors=7)
remove.lineage(taxonomy=stoma.final.pick.pds.taxonomy, name=stoma.final.pick.names, group=stoma.proovikaupa.pick.groups, fasta=stoma.final.pick.fasta, taxon=k__Archaea;unclassified;-k__Bacteria;unclassified-Root;unclassified;, dups=T)
system(mv stoma.final.pick.pick.names stomaP.final.names)
system(mv stoma.final.pick.pick.fasta stomaP.final.fasta)
system(mv stoma.proovikaupa.pick.pick.groups stomaP.final.groups)
system(mv stoma.final.pick.pds.pick.taxonomy stomaP.final.taxonomy)
system(./kohtadeks.pl stomaP.final.groups > stomaP.kohtadekaupa.groups)
dist.seqs(fasta=stomaP.final.fasta, cutoff=0.10, processors=7)
cluster(column=stomaP.final.dist, name=stomaP.final.names,method=furthest, cutoff=0.05)
make.shared(list=stomaP.final.fn.list, group=stomaP.kohtadekaupa.groups, label=0.05)
The first script creates groupfile by taking seqnames like >jadajada_4B562 and writing out jadajada/t4B562, second one takes groupfile and modifies it from
jadajada4B562/t4B562 to jadajada4B562/t4 (The 4 before B denotes a subject numbers after it a sample)
My mothur version is mothur v.1.23.0, Last updated: 1/9/2012, 64 bit version. Ran on Ubuntu 11.04 - the Natty Narwhal.

westcott · May 1, 2012, 2:14pm

Do the names look okay in the dist and name file that go into cluster? We have had a few bug reports from LInux users that seem similar to yours but the problem is occurring in dist.seqs with multiple processors. Also, do you know the version of g++ you are using? g++ --version should tell you.

jenz · May 1, 2012, 3:12pm

Unfortunately I have already mucked around in many of my files and hadn’t kept a good record.
It might be that it is a problem in the dist part. (I already had similar problem once when cluster (next step from dist) complained about sequence names being out of synch - I made a little post about it in the “multiple processors” thread)
I try to run my pipeline again in a few days if possible and keep a good record up until the problem.

jenz · May 1, 2012, 4:36pm

I took a look and it seems that the replacements start at the list file. The files being large I cannot search them otherwise than to use some perl scriopts.
I used a simple script to look into dist and list files.
while (my $line = <>) {
if ($line=~m//){print “+”;}

}
It did report erplacements in list but not in dist files. I was able to continue with my analyses by running the list. file to a script that replaced the instances of , with 1. (found out that it was in this place in the original name.) It seems to replace random ones with a box symbol filled with 0011 (left to right top to bottom).

Also.
My g++ -version is
gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)

westcott · May 1, 2012, 5:42pm

The names in the name file are okay, right? Also, would it be possible for you to send a small sample of your seqs so we can try to reproduce and resolve the issue? mothur.bugs@gmail.com

Topic		Replies	Views
Another issue...Pre.cluster Commands in mothur	3	2614	October 19, 2015
no equal numbers of sequences between name and group file mothur bugs	6	6865	May 5, 2012
Error in pre.cluster command mothur bugs	1	5187	July 18, 2012
more sequences in groupfile than in name file mothur bugs	4	4135	July 13, 2012
Get.seqs returning different numbers Commands in mothur	3	2678	March 1, 2013

cluster list file created with replaced symbols in seq names

Related topics