Bug in Win_64 cmd version

When using the
Windows version, 64 Bit Version mothur v.1.32.1
Last updated: 1/6/2014
(yes I copied that from the logfile ;))

I have encountered two bugs:

First, the program is reporting that the names file does not contain or is missing read names, when in fact they are there. I have not been able to figure out why the program cannot “see” them. It happens most often with those names that are listed to the right in the comma delimited portion of the file. It gives me the following message:
[ERROR]: 07HYPI40V02DIQLX is in your fasta file, and not in your namefile, please correct.
Sometimes it does work (although it is reporting the error the entire time), but then the next time I have to use the file, it doesn’t work at all. At times this causes the program to crash.

Second, when generating the groups file, reads are being duplicated within the file. For example:
Your groupfile contains more than 1 sequence named 39ICYDE7004IR2IM, sequence names must be unique. Please correct.
Once the duplicated sequences are removed, the file is fine and can be used.

I downloaded the MothurGUI.win_64, and it can read the names files fine. So, it does not seem to be a problem with how the files are being written.

Thanks.

James

Can you post the command you ran the files with?

This occurs at any point where a names file should be included in the analysis (all steps that ask for the name file).

The duplicate values that are included in the groups file occurs each time the group file is created.

Here an instance where I get the message that names files is a problem, but the analysis “seems” to work:

mothur > unique.seqs(fasta=allsffRPtrmd.trim.fasta, name=allsffRPtrmd.trim.names)
Unable to open C:\mothur\in\allsffRPtrmd.trim.fasta. Trying output directory C:\mothur\out\allsffRPtrmd.trim.fasta
Unable to open C:\mothur\in\allsffRPtrmd.trim.names. Trying output directory C:\mothur\out\allsffRPtrmd.trim.names
[ERROR]: 07HYPI40V02DIQLX is in your fasta file, and not in your namefile, please correct.
[ERROR]: 07HYPI40V02EM9O0 is in your fasta file, and not in your namefile, please correct.
. . . .
[ERROR]: 39ICYDE7004IHIUI is in your fasta file, and not in your namefile, please correct.
[ERROR]: 39ICYDE7004IR2IM is in your fasta file, and not in your namefile, please correct.
33139 24151

Output File Names:
C:\mothur\out\allsffRPtrmd.trim.unique.names
C:\mothur\out\allsffRPtrmd.trim.unique.fasta

But the resulting file is problematic.
mothur > summary.seqs(name=current)
Using C:\mothur\out\allsffRPtrmd.trim.unique.names as input file for the name parameter.
Using C:\mothur\out\allsffRPtrmd.trim.unique.fasta as input file for the fasta parameter.

Using 1 processors.
[ERROR]: ‘07HYPI40V02DIQLX’ is not in your name or count file, please correct.
Note: But it is there . . . this causes the summary.seqs to stop processing the file.


[b]Another attempt[/b]:

Output File Names:
C:\mothur\out\allsffRPtrmd.trim.fasta
C:\mothur\out\allsffRPtrmd.scrap.fasta
C:\mothur\out\allsffRPtrmd.trim.qual
C:\mothur\out\allsffRPtrmd.scrap.qual
C:\mothur\out\allsffRPtrmd.trim.names
C:\mothur\out\allsffRPtrmd.scrap.names
C:\mothur\out\allsffRPtrmd.groups


mothur > summary.seqs(fasta=allsffRPtrmd.trim.fasta, name=allsffRPtrmd.trim.names) Unable to open C:\mothur\in\allsffRPtrmd.trim.fasta. Trying output directory C:\mothur\out\allsffRPtrmd.trim.fasta Unable to open C:\mothur\in\allsffRptrmd.trim.names. Trying output directory C:\mothur\out\allsffRptrmd.trim.names
Using 1 processors. [ERROR]: '07HYPI40V02DIQLX' is not in your name or count file, please correct.

But again, here is an instance where it seems to work
mothur > summary.seqs(name=current)
Using C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.names as input file for the name parameter.
Using C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.unique.fasta as input file for the fasta parameter.

Using 2 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 337 96 0 2 1
2.5%-tile: 1 340 116 0 3 819
25%-tile: 1 340 123 0 4 8190
Median: 1 340 123 0 4 16379
75%-tile: 1 340 133 0 4 24568
97.5%-tile: 1 340 143 0 5 31939
Maximum: 1 340 171 4 7 32757
Mean: 1 339.999 126.713 0.0141954 3.95842

of unique seqs: 7210

total # of seqs: 32757

Output File Names:
C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.unique.summary

But immediately after doing that, there are issues with the file. And, of course, the error regarding the duplicate values in the groups file, which I had not corrected.
mothur > pre.cluster(fasta=current, name=current, group=allsffRPtrmd.pick.good.groups, diffs=2)
Using C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.unique.fasta as input file for the fasta parameter.
Using C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.names as input file for the name parameter.
Unable to open C:\mothur\in\allsffRPtrmd.pick.good.groups. Trying output directory C:\mothur\out\allsffRPtrmd.pick.good.groups

Using 2 processors.
Your groupfile contains more than 1 sequence named 07HYPI40V02DIQLX, sequence names must be unique. Please correct.
Your groupfile contains more than 1 sequence named 07HYPI40V02EM9O0, sequence names must be unique. Please correct.
Your groupfile contains more than 1 sequence named 07HYPI40V02DH8EP, sequence names must be unique. Please correct.
. . .
. . .
Your groupfile contains more than 1 sequence named 39ICYDE7004IHIUI, sequence names must be unique. Please correct.
Your groupfile contains more than 1 sequence named 39ICYDE7004IR2IM, sequence names must be unique. Please correct.

[ERROR]: Your name file contains 0 valid sequences, and your groupfile contains 32801, please correct.
[ERROR]: process 0 only processed 1 of 6 groups assigned to it, quitting.

/******************************************/
Running command: unique.seqs(fasta=C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.unique.precluster.fasta, name=C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.unique.precluster.names)
[ERROR]: C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.unique.precluster.fasta is blank, aborting.
Using C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.unique.fasta as input file for the fasta parameter.
[ERROR]: C:\mothur\out\allsffRPtrmd.trim.unique.pick.good.filter.unique.precluster.names is blank, aborting.
/******************************************/


At this point, the program crashes.

I have noticed that there was an issue with the names files once before, and it seemed to be associated with an external program that was modifying the files in some way. It would be nice if the information regarding what program it was and what modification was done to the names files, was posted to the forum. That way, I could check to see if this is also an issue. What I have done is tried changing the line ends between the various formats: Unix, MacOSX, Windows (as this is commonly an issue that can affect a programs ability to use a file). This did not remedy the problem.

Why the groups files generated contain duplicate values, I cannot say.

Hope this helps.

James

Also, for the “duplications” occurring in the groups files:

I extracted the reads directly from the sff files. I have checked the original fasta file used in the analysis. None of the reads are duplicated, all of the headers contain unique names. When eliminating the duplicates in the groups file, It usually is around 4800 duplicate names (if I am remembering correctly). I have not tried to determine if all of the duplicated reads in the GROUPS file are identical with the reads that are giving me problems in the NAMES files. However, there are a preponderance of reads in common that are being flagged in both the NAMES and GROUPS files.

So, I strongly suspect that the duplicated names in the GROUPS files are also the ones that cannot be “found” in the NAMES files.

James

I suspect all the issues are stemming from the original unique.seqs issue. Is 07HYPI40V02DIQLX is the first column of the names file? Mothur expects all sequences in the fasta file to be in the first column of the names file. This is because mothur assumes they are unique or the representative sequences for the sequences in column 2.

No, that particular sequence - and all of the others that I have checked, are in the comma delimited portion of the file. They are not in the first column of the names file. So those are not the “unique” sequences, but the “duplicates” identified by the unique.seqs command.

James

Okay, that is the issue. How did you create this names file?

The original NAMES file was created using the unique.seqs command on the fasta file generated from the trim.seqs command.

Unfortunately, I am at another location and cannot access those files generated from the Win_64 program at this moment.

I am using the same version of mothur on a 32-bit platform that I have access to right now.
The NAMES files appears the same in both instances. For example, any of the read names that appear in the second (comma-delimited) portion of the NAMES file generated using the Win_32bit program file, do not appear anywhere else in the NAMES file (on an individual line). I have not had any issues with the Win_32 bit platform, but was conducting a different type of analysis. I can try replicating the same analysis on the Win_32 bit platform as I was doing on the Win_64 bit platform, and see if I have the same difficulties.

In both instances, I am using Windows version, mothur v.1.32.1.

Otherwise, any further information regarding the Win_64 bit version of mothur will have to wait until I return home. :D.

James

Maybe I should clarify my last statement with an example from the NAMES file. In this case I am using the one generated on the WIN_32 bit version of mothur.

So, a portion of the NAMES file after running unique.seqs on the trim.seqs fasta file looks like this:

HYPI40V02DEC8K HYPI40V02DEC8K,HYPI40V02EV67D,HYPI40V02DTE0P,HYPI40V02EAIKU
HYPI40V02DQAGI HYPI40V02DQAGI,HYPI40V02DUKBE,HYPI40V02D1DKP
HYPI40V02D57QL HYPI40V02D57QL
HYPI40V02EBEO5 HYPI40V02EBEO5

The read names that the Win_64 bit version of the program is having difficulty finding are in bold. These read names are not appearing anywhere else in the file (they are not present in the first column).
In the groups file that is generated, these same names (in bold) appear to be the ones that are duplicated. I say “appears to be” only because I have not determined if this applies to all of them, but that would make sense based on the ones I have searched for.

I hope this makes it more clear.

I think perhaps you may have some file overwriting going on, or you are using the wrong files together. Let me try to clear a few things up about what mothur is expecting. Mothur uses the names file to help save processing time. This is done by assuming that the sequence in the first column represents all the sequences in the second column. For all intents and purposes to mothur they are identical. Therefore in the fasta file only 1 copy of the sequence is needed. This saves time in processing. For example instead of aligning 100 identical sequences, we align 1. Or when calculating distances only one calculation is needed. Mothur uses the names files to keep track of all the redundant sequences so no information is lost. Can you go back and try rerunning things from the trim.seqs command, and post the commands you are running as well as mothur’s outputs?

Using the wrong files is always a possibility. I will have to do this at home.

I’ll restart the analysis from the beginning, and post the analysis logfile to you. But this won’t be done until later this afternoon/evening.

Thanks,

James

westcott,

Thanks for your assistance. Somehow I must have incorporated a wrong names file, as you pointed out, sometime in the process. I am no longer having any issues with analysis on the Win_64 platform. Not sure exactly when or how I mucked things up, but things are working fine now.

My apologies for making you spend time on this. Thanks again for your assistance.

James

No worries, glad you got it working, :slight_smile: