Keeping Illumina headers unchanged for non-Mothur use

After commands like trim.seqs, screen.seqs, make.contigs, etc., Mothur always tells you “I changed your : to - for ease in downstream analyses”. The only problem is that other programs, e.g. Pandaseq (for pairing R1 and R2 Illumina sequences) won’t read the sequences in the format Mothur changes to. Is there a command to pass to bypass the Illumina sequence header change that Mothur automatically likes to do? Thanks.

We could add that as a feature to make.contigs, but… Pandaseq is a program to make contigs, which is what make.contigs does. So I’m not sure why you would need to do this… FWIW, we change the : to _ for people that like to make trees.

I think it would be a worthwhile option/command to have in Mothur, as there are so many programs and scripts that require specific headers to read. I wanted to use Pandaseq after some sequence truncating/editing in Mothur to compare Mothur’s read-pairing algorithm to Pandaseq’s, as I know they are built slightly differently.

In case you’re interested why, I’ve been pairing reads from fungal ITS amplicons in Mothur, and trying different levels of quality filtering/truncating of individual reads prior to pairing. I’m getting worried because when I truncate reads more and more (e.g. increasing qwindowaverage, playing with qwindowsize), Mothur is still pairing all my reads but just making them shorter on average. Shouldn’t I get to a point where Mothur can’t pair the reads any more because the region of overlap has been cut off?

Thanks for your help.

Do the number of reads with N’s go up as the contigs get shorter?

Thanks for the suggestion. I added a modifynames parameter to the set.dir command. In the next release you will be able to run the following:

set.dir(modifynames=F)

to override the changes to sequence names in mothur’s commands.

Interestingly, the number of Ns seem to go down as the reads get shorter. Looking only at the average paired-read length, the average decreases as quality trimming becomes more rigorous, but only slightly (e.g. 214bp, 212bp, 211.9bp, etc.). However, the shortest read in the dataset becomes smaller and smaller, e.g. 172bp, 75bp, 71bp, 31bp, and this sequence never contains any Ns. As a result of the lower proportion of N-containing sequences with each round of filtering, the total amount of retained sequences after screen.seqs( maxambig=0) is larger each time.

For what it’s worth, I doubt that this will seriously affect the ecological conclusions of my study down the road, as I also filter my OTU table fairly stringently, subsample/rarefy, etc., but it is somewhat puzzling at this stage…

Thanks for your help.