.dist file with more than 3 columns

Lucas · September 11, 2013, 10:44am

Hi all,

I ran into a problem while processing my data, that I think is a bug from the dist.seqs() step.

While running the command cluster(column=mysesqs.trim.unique.good.filter.unique.pick.dist, count=mysesqs.trim.unique.good.filter.uchime.pick.count_table, cutoff=0.2), the error message output was: [ERROR]: SH33SED.10SS.3.1.1019571_167878 is not in your count table. Please correct.

The line in the mysesqs.trim.unique.good.filter.unique.pick.dist file that contained the trouble id contains 4 columns, instead of 3: “SS.3.1.1019571_167878 SH33SED.10SS.3.1.1019571_167878 SH33SED.1020220_67075 0.1919” .
The extra column is formed by 2 ids that somehow got concatenated.
Searching in my dist file for lines wiht more than 2 columns, I found out that 12 lines contain more that 3 columns, normally with a concatenated id in between two normal ids.
It is ok for me to check and correct them manually, but I would like to let you guys know about the problem.
The command runned to generate the dist file was:
dist.seqs(fasta=mysesqs.trim.unique.good.filter.unique.pick.fasta, cutoff=0.2, processors=32)

Maybe it was a writing problem? I was using the full cpu capacity of my computer.
Cheers,

westcott · September 11, 2013, 12:25pm

Hmmm… that is odd. With multiple processors, mothur divides the distance calculations between the processors and then appends the resulting files. It looks like an error on the appending. I am not able to reproduce the error on our test machines. Can you try with less processors?

Lucas · September 11, 2013, 12:55pm

Hi,

Yes, I am trying right now. It is running with 30 processors now. Curiously, now it is not using 100% of capacity of each processor.
I will let you know whether this solves the problem.
Thanks,
Lucas

Lucas · September 16, 2013, 8:33am

Hi,

I solved the problem It was my mistake. The cut-off of the dist.seqs() step was too high, and my file was huge. It saturated the capacity of my hardrive, so the concatenated lines were probably caused by the program having to write in a file without enough space to write on.
I lowered the cut-off down to 0.05 and it was all solved.
Thanks for the help
Lucas

Topic		Replies	Views
Large dist.seqs producing corrupt files? mothur bugs	11	10586	November 1, 2016
cluster(column=xxx) & count table Commands in mothur	10	5961	July 20, 2015
Dist.seqs - Segmentation fault	3	91	May 7, 2024
Segmentation fault when clustering a 1.44 GB dist file mothur bugs	5	135479	November 14, 2009
Problems writing .dist file mothur bugs	4	4928	June 16, 2011

.dist file with more than 3 columns

Related topics