mothur

Error in summary.seq after alignment

Just before the release of mothur v1.44, our supercomputing facility just updated to mothur v1.43, so that is the version I am using. To test this version I used a subset of my larger dataset to run through the mothur SOP. We worked out the kinks and I was able to run the subset through several SOP iterations. When I tried to run the larger dataset through I continually get the same error with summary.seqs after alignment.

[WARNING]: We found more than 25% of the bases in sequence EQcs|#*3;?@_8BMSPGEKOPLC7)yjV="a?ybWO@y7=G)Vw3Q^[u to be ambiguous. Mothur is not setup to process protein sequences.

[ERROR]: ‘EQcs|#*3;?@_8BMSPGEKOPLC7)yjV="a?ybWO@y7=G)Vw3Q^[u’ is not in your name or count file, please correct.

[ERROR]: Your count file contains 10723136 unique sequences, but your fasta file contains 136970. File mismatch detected, quitting command.

I decided to:

  1. See which line the sequence EQcs|#*3;?@_8BMSPGEKOPLC7)yjV="a?ybWO@y7=G)Vw3Q^[u was in by using

grep -rn "EQcs|#*3;?@_8BMSPGEKOPLC7)yjV" NPRB20.trim.contigs.good.unique.align

The string was not found in the .align file.

  1. Check how many sequences were in the following files using
    grep ">" filename | wc -l

The results were:
NPRB20.trim.contigs.good.fasta = 14622735
NPRB20.trim.contigs.good.unique.fasta = 10723136
NPRB20.trim.contigs.good.count_table = 14622735; unique = 10723136
NPRB20.trim.contigs.good.unique.align = from mothur.logfile 10723136; from grep 10723168

Note that the number of unique seqs from the .align logfile matches the count table and unique.fasta, yet the number via the grep command is different, i.e. has 32 more sequences. The commands I ran for the alignment and summary are:

align.seqs(fasta=NPRB20.trim.contigs.good.unique.fasta, reference=silva.nr_v138.pcr.align, flip=T, processors=8)

summary.seqs(fasta=NPRB20.trim.contigs.good.unique.align, count=NPRB20.trim.contigs.good.count_table, processors=8)

Any help on this would be much appreciated as these resulting data are the only thing I can work on during quarantine :).

I suspect something happened mid-stream when running an earlier command and a file got messed up. Can you try going back a step or two and see if you still get the error?

Thanks,
Pat

I re-ran the the steps leading up to alignment and it looks as though the alignment and subsequent summary worked as normal. Then when I ran the subsequent screen.seqs and summary commands I got the same errors, but with a different name for the “protein sequence”.

[WARNING]: We found more than 25% of the bases in sequence^@^D8^P^C’^B^O^B^@n to be ambiguous. Mothur is not setup to process protein sequences.

[ERROR]: ^@^D8^P^C’^B^O^B^@n’ is not in your name or count file, please correct.

[ERROR]: Your count file contains 10686830 unique sequences, but your fasta file contains 2626198. File mismatch detected, quitting command.

I have a couple theories as to what might be happening, but I’m not sure if they are valid.

  1. With past versions I have had trouble running large datasets using multiple processors, particularly on commands that are a bit heftier like align.seqs and dist.seqs. I ran align.seqs with 8 processors when I got the 1st warning, and then screen.seqs with 28 processors (cause I forgot to set the # of processors in the batch command) with the 2nd warning.

  2. I tried something new with screen.seqs using the optimize and criteria options just to see how similar it would be to my own choices for things like maxlength, start, and end. So, maybe it has something to do with that.

I’ll test those theories and let them run overnight and hopefully will have an answer. If it isn’t either of those things I’m pretty stumped as to what to try next, any thoughts?

Sorry for inundating you with info, but I have results from my test runs.

The alignment step doesn’t seem to work, i.e., errors with subsequent summary.seqs, when more than one processor is used. Running screen.seqs with only one processor didn’t sem to make a difference as the subsequent summary didn’t work and provided errors.

When I ran screen.seqs after alignment with start and end in lieu of the optimize and criteria options I got a completely different error when I tried to run the subsequent summary.

[ERROR]: ‘M03580_119_000000000-J35C5_1_2114_22308_11313’ is not in your name or count file, please correct.

[ERROR]: ‘M03580_119_000000000-J35C5_1_1118_25052_7719’ is not in your name or count file, please correct.

[ERROR]: Your count file contains 14710 unique sequences, but your fasta file contains 2. File mismatch detected, quitting command.

Just for reference my summary after the alignment with 1 processor and before screen.seqs looks like this:

Start End NBases Ambigs Polymer NumSeqs

Minimum: 0 0 0 0 1 1

2.5%-tile: 8 9582 252 0 3 365569

25%-tile: 8 9582 252 0 4 3655684

Median: 8 9582 252 0 4 7311368

75%-tile: 8 9582 252 0 4 10967052

97.5%-tile: 8 9582 253 0 6 14257167

Maximum: 9582 9582 299 0 18 14622735

Mean: 12 9577 252 0 4

unique seqs: 10723136

total # of seqs: 14622735

This is all so strange because I was able to run a subset of these data without a hitch.

I think I figured out my problem :raised_hands:. It turns out someone outside of our project had been using our workspace on the supercomputer and was using the majority of the storage memory. On my end it was showing I was only using 300/1000GB, but it turns out that the usage was really more like 875GB/1000GB. So, I think that that output files weren’t being written correctly due to storage issues. I am working with our supercomputing facility to set up a warning for getting closer to usage limits. In the meantime, it has been running without a hitch and is currently at the cluster.seqs phase, yay!

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.