Manipulate Sequence Identifiers

Dear mothur team,

with some data sets, it is necessary for me to manipulate sequence identifiers in the FASTA files (e.g., change sequence ID “HY5WXKH02C2LQY” to “group1HY5WXKH02C2LQY”). This is particularly important when combining data from different sequencing runs: As the sequencer “recycles” IDs, they are not unique anymore in a data set combined from different runs unless you rename them.

I would thus find it very convenient if mothur provided a command for sequence ID manipulation, such as prefixing the group name to it. I have already combed through the list of mothur’s commands, but could not find anything like it.

Thank you very much in advance for considering my request!


Kind regards, Sven

they are not unique anymore in a data set combined from different runs unless you rename them

Really? HY5WXKH is a time stamp, 02 is the location on the plate and then C2LQY is the x/y coordinate of the plate (see SFF Read names - SEQanswers). Pretty sure that every sequence name is unique.

Pat

Hi Pat,

Thank you for your quick reply!

Within the thread you refer to, there is also the following citation:

This identifier is guaranteed to be unique only within the context of a single sequencing Run, and may or may not be unique across specific sets of Runs.

This is exactly my problem. I would like to combine sequences across sets of runs, and it frequently happens that I end up with duplicate sequence IDs when doing that. I solve the problem by prefixing some alphanumeric code to them that is specific for the respective run using a script of my own, but would find it nicer to have this functionality directly in mothur.

There are also other situations where custom sequence ID prefixes might be useful. For the use of down-stream applications other than mothur which cannot use mothur’s group file, group-specific sequence ID prefixes would be really great.


Best, Sven

Hello -

I think I may be having a similar problem. I’m trying analyse a set of sequences that came from 2 separate sequencing runs with all unique barcodes. We have created a groups file that runs through to the screen.seqs step successfully, but after that I get errors every time I try to add the groups file – which means that there is no indication of ‘sample’ at the end of the analysis. Thus, I output an OTU table that has no samples/groups, it is all lumped together. Have you come up with a solution for this?

Thank you!
Kathy

Could you post the error you are getting after screen.seqs, and the commands you are running up to that point?

Thanks for the feature request. The rename.seqs command will be part of 1.32.0.