Just before the release of mothur v1.44, our supercomputing facility just updated to mothur v1.43, so that is the version I am using. To test this version I used a subset of my larger dataset to run through the mothur SOP. We worked out the kinks and I was able to run the subset through several SOP iterations. When I tried to run the larger dataset through I continually get the same error with summary.seqs after alignment.
[WARNING]: We found more than 25% of the bases in sequence EQcs|#*3;?@_8BMSPGEKOPLC7)yjV="a?ybWO@y7=G)Vw3Q^[u to be ambiguous. Mothur is not setup to process protein sequences.
[ERROR]: ‘EQcs|#*3;?@_8BMSPGEKOPLC7)yjV="a?ybWO@y7=G)Vw3Q^[u’ is not in your name or count file, please correct.
[ERROR]: Your count file contains 10723136 unique sequences, but your fasta file contains 136970. File mismatch detected, quitting command.
I decided to:
- See which line the sequence EQcs|#*3;?@_8BMSPGEKOPLC7)yjV="a?ybWO@y7=G)Vw3Q^[u was in by using
grep -rn "EQcs|#*3;?@_8BMSPGEKOPLC7)yjV" NPRB20.trim.contigs.good.unique.align
The string was not found in the .align file.
- Check how many sequences were in the following files using
grep ">" filename | wc -l
The results were:
NPRB20.trim.contigs.good.fasta = 14622735
NPRB20.trim.contigs.good.unique.fasta = 10723136
NPRB20.trim.contigs.good.count_table = 14622735; unique = 10723136
NPRB20.trim.contigs.good.unique.align = from mothur.logfile 10723136; from grep 10723168
Note that the number of unique seqs from the .align logfile matches the count table and unique.fasta, yet the number via the grep command is different, i.e. has 32 more sequences. The commands I ran for the alignment and summary are:
align.seqs(fasta=NPRB20.trim.contigs.good.unique.fasta, reference=silva.nr_v138.pcr.align, flip=T, processors=8)
summary.seqs(fasta=NPRB20.trim.contigs.good.unique.align, count=NPRB20.trim.contigs.good.count_table, processors=8)
Any help on this would be much appreciated as these resulting data are the only thing I can work on during quarantine :).