Hi there, I am quite new to Mothur and have kept coming up with the same issue in my dataset so am looking for some advice.
I have a lage dataset (62) of 16s rRNA samples from the V3-V4 region with the following primers:
341F Primer CCTACGGGNGGCWGCAG
806R Primer GGACTACNVGGGTWTCTAAT
The problem comes when I get to align.seqs where I get this warning:
It took 6298 secs to align 20287188 sequences. [WARNING]: 17744737 of your sequences generated alignments that eliminated too many bases, a list is provided in Astability.trim.contigs.good.unique.flip.accnos. [NOTE]: 10000155 of your sequences were reversed to produce a better alignment. It took 6299 seconds to align 20287188 sequences.
Removed 27204442 sequences from Astability.trim.contigs.good.count_table.
[WARNING]: Astability.trim.contigs.good.count_table contains only sequences from the .accnos file.
Output File Names:
Astability.trim.contigs.good.pick.count_table
/******************************************/
Output File Names:
Astability.trim.contigs.good.unique.good.align
Astability.trim.contigs.good.unique.bad.accnos
Astability.trim.contigs.good.good.count_table
It took 13904 secs to screen 20287188 sequences.
I have tried a number of things to get this to stop happening. I’ve double checked my silva reference and coordinates, I’ve also run Trimmomatic on the raw fastq sequences and then used these to follow the pipeline to align.seqs again. I will try using the files prior to align seqs to continue the pipeline, but would really like to know what I’ve done incorrectly.
What are you using to align your sequences? The quality of the alignment or the sequences looks pretty bad. Not many of the sequences are the right length. More than half are 14 or so nucleotides long.
Can you clarify what reference file you are using and what the commands and specific syntax you are running upstream of screen.seqs?
Also, I generally discourage people from sequencing with the 2x300 chemistry as well as the v3-V4 region.
I downloaded an ecoli fasta from NCBI and trimmed this to my primers and saved it as FINALecoli.fasta. I then aligned this to the silva seed reference:
Apologies if I have left out any information, I am still new to using mothur. But please do let me know if you have any other questions, I appreciate the help!
Again this resulted in a similar warning of "17612804 of your sequences generated alignments that’s eliminated too many bases… **
****9841274 of your sequences were reversed…"
Here is the summary: Mothur > summary.seqs(fasta=Astability.trim.contigs.good.unique.align, count=Astability.trim.contigs.good.count_table)
So the screening removed all of the sequences. Would there be much point to me redoing the screen.seqs with a wider interval? It has been taking about 5 hours to run each command so it’s difficult to tinker about. Any advice is welcomed.
Following from screen.seqs(fasta=Astability.trim.contigs.good.unique.align,count=Astability.trim.contigs.good.count_table, start=1000, end=40000)
It took 10411 secs to screen 20287188 sequences, removed 20287188.
Output File Names:
Astability.trim.contigs.good.unique.good.align -EMPTY
Astability.trim.contigs.good.unique.bad.accnos
Astability.trim.contigs.good.good.count_table -doesn’t exist
I then tried to run: Mothur > filter.seqs(fasta=Astability.trim.contigs.good.unique.align, vertical=T, trump=.)
(Using the output file from align.seqs since the empty one from screen.seqs couldn’t be used)
So you are sequencing the v3-v4 region, of approx 460 pb. As Pat said, you may have a problem in the alignment (with the make.contigs command)… In summary, you are probably sequencing a chunk of 460 pb with a pair end approach that sequence 250 on each end. It’s cutting it very close, and… well, Pat explains it very well on his post, so please check Why do I have such a large distance matrix
If you review your summary table, you have 680k sequences (of your 27M) that are above 400pb (I’m guessing you have even less with >460pb…) Anyway, all of them start at around 43000 and ends in 43116, so if you delete all outside the range 1000 and 40000 you are basically deleting everything that may be on the correct size.
I think you’re seeing exactly what we saw in Table 2 of Kozich with this region. To start, you have very little overlap between the two reads (~75 nt) and if I had to guess you had a bad sequence run. After assembling the reads with make.contigs you then probably removed a lot of sequences that had ambiguous bases, right?
Looking at your most recent output from summary.seqs, more than 25% of your sequences end at the beginning 5’ end of the gene (see end values <= 1096; the bases start at 1044) and more than 25% of your sequences start at the end of the 3’ end of the gene (see start values >=43061). I suspect there’s more than 50% of your reads that fall in this area.
You found that the V3-V4 region started at 6388 and ended at 25318. Those values seem right. Unfortunately, none of those values show up in the output from summary.seqs - that doesn’t mean they aren’t there, but there aren’t many of them, which supports what you’re seeing. When you run screen.seqs with the full alignment database you should be using start=6388 and end=25318. The start value is the point in the alignment where sequences start at or before and end is the point in the alignment end at or after.
I wonder if you actually sequenced what you think you did. I’m curious what happens when you take one of your unaligned sequences and run it through blast. Does it come back as a 16S? I have a suspicion that it might be PhiX (or someone else’s project).