I’m trying to analyze fungal ITS2 sequences from Sanger sequencing & am following along with a protocol that @Kendra shared with me in this post I made about getting started earlier this year & I have hit a snag that doesn’t make sense.
My approach thus far has been to pretty much mimic what was in the protocol mentioned above:
summary.seqs(fasta=seqs.txt)
unique.seqs(fasta=seqs.txt)
count.seqs(name=seqs.names, group=group.txt)
pre.cluster(fasta=current, diffs=2, count=current)
summary.seqs(fasta=current, count=current)
chimera.vsearch(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current, count=current)
summary.seqs(fasta=current, count=current)
When I get to the summary.seqs(fasta=current, count=current)
portion at the end of this code block I get two errors:
[ERROR]: 'LCl_84' is not in your name or count file, please correct.[ERROR]: 'HCO3_120' is not in your name or count file, please correct.
[ERROR]: Your count file contains 94 unique sequences, but your fasta file contains 76. File mismatch detected, quitting command.
The interesting part about this is when I rerun that command, I get different lengths on the error message for my FASTA file. For example, the first error message I received said that my count file had 94 sequences, but my FASTA file had 76. It quit the command due to this mismatch. When I run the command again, I get this error:
[ERROR]: 'HCO3_120' is not in your name or count file, please correct.[ERROR]: 'LCl_84' is not in your name or count file, please correct.
[ERROR]: Your count file contains 94 unique sequences, but your fasta file contains 53. File mismatch detected, quitting command.
And, if I run the same summary.seqs()
command once more, I get yet another value in the FASTA error:
[ERROR]: 'HCO3_120' is not in your name or count file, please correct.[ERROR]: 'LCl_84' is not in your name or count file, please correct.
[ERROR]: Your count file contains 94 unique sequences, but your fasta file contains 62. File mismatch detected, quitting command.
Why are the sample names & number of unique sequences consistent with the .count_table
file, but highly variable with the number of sequences in the FASTA file.
Also, when I manually inspect the FASTA file I count 97 unique sequences.
I’m not really sure what is going on under the hood of the de novo clustering step, or the chimera call-out step to make this error occur, and why it has variable FASTA sequences each time I run it. Has anyone run into this issue before?