Mismatch .groups file after using chop.seqs


I’m compiling two sets of data from different sequencing run, which used different sequencing kits (iontorrent 100 vs 200bp kit). As such, once I merged the data I had a problem of the 200bp sequenced samples aligning better to themselves with a significant overhang that was not observed in the 100bp sequenced samples. To deal with this problem, I used chop.seqs on my merged data to cut off the overhanging bases found in the 200bp kit. In doing this, I was left with a nice looking alignment, but unfortunately there was no .groups option in the chop.seqs command, so the few sequences that were completely cut from my .fasta file (all samples that were <110bp long) were not cut from my .groups file, leaving me with an inconsistent number of sequences, and unable to conduct pre.cluster on my data. I’ve tried using screen.seqs on my fasta and groups file to get the number of sequences to match, but restricting read length to 110bp (via minlength=110) doesn’t seem to have an effect on my .groups file.

I was wondering if anyone knew how I could either create a new .groups file that matches my .fasta file (taking into account that there are approximately 35 different samples with different names in this dataset), or maybe a way to trim down my .groups file so that the number of sequences align with my .fasta and .names files so that I can run pre.cluster and then dist.seqs and cluster analyses. I’ve already tried using ‘make.groups’ on my .fasta file but I don’t know how to keep my group names consistent with the names found in my .names file.

Thanks, I have my workflow available if it will help.


Nevermind I resolved the issue. I had to chop the sequences after screen.seqs so that no sequences would be removed from my .fasta file during chop.seqs, keeping the number of sequences consistent between my .fasta and .groups files.