align.seqs and screen.seqs give odd result in defined control sequence set

Hi,

I am new to mothur, however I have been working with the MySeq.SOP with a 16S V4 amplicon sequencing set from a defined set of DNA samples from pure cultures. Essentially 12 genomic DNAs of equal mass from 12 different bacterial and archaeal cultures were mixed and subjected to amplification using universal V4 primers. Using the SOP, I successfully created a FASTA file with about 27,000 sequences. I used the the latest Silva reference dataset (v128) and ran the align-seqs in mothur and successfully aligned all 27,000 sequences. When I ran screen.seqs with the designated start and end points from the summary.seqs output, the number of unique sequences reduced to 33. When I used filter.seqs the Fasta file generated also contained 33 sequences of the correct length. OK at this point, I thought this is great. My control amplicon sequencing run shows relatively few sequences. Unfortunately, when I batch screened this Fasta file against the RDP and Silva website tools, only 8 of the expected 12 16S sequences were present. I thought at that time that the other sequences may not have amplified, however when I went to the original Fasta file created with 27,000 sequences and subsampled about 1000 sequences in four batches for screening against the RDP and Silva databases, three of the missing sequences were present in the original Fasta file which was used for the align.seqs command in mothur. Although these 3 missing sequences from the screen.seqs output have relatively low abundance relative to other sequences (another issue not related to this processing), they were definitely present in the original Fasta file with the correct amplicon length. What happened to eliminate these sequences from the final output? I am glad I ran this control, however it concerns me how this will impact the analyses of by experimental samples. I have the documentation of this run if that would help pasted below.



mothur > make.file(inputdir=FldSeq, type=fastq, prefix=stability) Setting input directory to: FldSeq/

Output File Names:
FldSeq/stability.files


mothur > make.contigs(file=stability.files, processors=4)

Using 4 processors.

Processing file pair FldSeq/V4_515F_New_V4_806R_New-control_1_ATGGAGCACT_R1.fastq - FldSeq/V4_515F_New_V4_806R_New-control_1_ATGGAGCACT_R2.fastq (files 1 of 1) <<<<<
Making contigs…

Done.

It took 17 secs to assemble 47321 reads.

It took 17 secs to process 47321 sequences.

Group count:
V4 47321

Total of all groups is 47321

Output File Names:
FldSeq/stability.trim.contigs.fasta
FldSeq/stability.trim.contigs.qual
FldSeq/stability.contigs.report
FldSeq/stability.scrap.contigs.fasta
FldSeq/stability.scrap.contigs.qual
FldSeq/stability.contigs.groups

[WARNING]: your sequence names contained ‘:’. I changed them to ‘_’ to avoid problems in your downstream analysis.

mothur > summary.seqs(fasta=stability.trim.contigs.fasta)

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 252 252 0 3 1
2.5%-tile: 1 290 290 0 4 1184
25%-tile: 1 292 292 0 5 11831
Median: 1 292 292 0 5 23661
75%-tile: 1 292 292 0 6 35491
97.5%-tile: 1 292 292 2 6 46138
Maximum: 1 500 500 204 135 47321
Mean: 1 292.499 292.499 0.379113 5.28605

of Seqs: 47321

Output File Names:
FldSeq/stability.trim.contigs.summary

It took 1 secs to summarize 47321 sequences.

mothur > screen.seqs(fasta=stability.trim.contigs.fasta, group=stability.contis.groups, maxambig=0, maxlength=300)

Using 4 processors.
Processing sequence: 100
Processing sequence: 11834

Output File Names:
FldSeq/stability.trim.contigs.good.fasta
FldSeq/stability.trim.contigs.bad.accnos
FldSeq/stability.contigs.good.groups


It took 1 secs to screen 47321 sequences.
mothur > get.current()

Current RAM usage: 0.0256004 Gigabytes. Total Ram: 24 Gigabytes.

Current files saved by mothur:
fasta=FldSeq/stability.trim.contigs.good.fasta
group=FldSeq/stability.contigs.good.groups
qfile=FldSeq/stability.trim.contigs.qual
contigsreport=FldSeq/stability.contigs.report
processors=4
summary=FldSeq/stability.trim.contigs.summary
file=FldSeq/stability.files

Current input directory saved by mothur: FldSeq/

Current default directory saved by mothur: /Users/Rob/FldSeq/mothur/

Current working directory: /Users/Rob/

Output File Names:
current_files.summary


mothur > summary.seqs() Using FldSeq/stability.trim.contigs.good.fasta as input file for the fasta parameter.

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 252 252 0 3 1
2.5%-tile: 1 291 291 0 4 1093
25%-tile: 1 292 292 0 5 10926
Median: 1 292 292 0 5 21851
75%-tile: 1 292 292 0 5 32776
97.5%-tile: 1 292 292 0 6 42608
Maximum: 1 300 300 0 9 43700
Mean: 1 291.911 291.911 0 5.14522

of Seqs: 43700

Output File Names:
FldSeq/stability.trim.contigs.good.summary

It took 0 secs to summarize 43700 sequences.

mothur > unique.seqs(fasta=stability.trim.contigs.good.fasta)
1000 867
43700 27495

Output File Names:
FldSeq/stability.trim.contigs.good.names
FldSeq/stability.trim.contigs.good.unique.fasta


mothur > count.seqs(name=stability.trim.contigs.good.names, group=stability.cotigs.good.groups)

Using 4 processors.
It took 0 secs to create a table for 43700 sequences.


Total number of sequences: 43700

Output File Names:
FldSeq/stability.trim.contigs.good.count_table


mothur > summary.seqs(count=stability.trim.contigs.good.count_table) Using FldSeq/stability.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 252 252 0 3 1
2.5%-tile: 1 291 291 0 4 1093
25%-tile: 1 292 292 0 5 10926
Median: 1 292 292 0 5 21851
75%-tile: 1 292 292 0 5 32776
97.5%-tile: 1 292 292 0 6 42608
Maximum: 1 300 300 0 9 43700
Mean: 1 291.911 291.911 0 5.14522

of unique seqs: 27495

total # of seqs: 43700

Output File Names:
FldSeq/stability.trim.contigs.good.unique.summary

It took 0 secs to summarize 43700 sequences.

mothur > pcr.seqs(fasta=silva.nr_v128.align, start=11894, end=25319, keepdots=, processors=4)

Using 4 processors.
Processing sequence: 100
Processing sequence: 47659

Output File Names:
FldSeq/silva.nr_v128.pcr.align


It took 127 secs to screen 190661 sequences.

mothur > rename.file(input=silva.nr_v128.pcr.align, new=FldSeq/silva.v4.align)

Current files saved by mothur:
fasta=FldSeq/silva.nr_v128.pcr.align
group=FldSeq/stability.contigs.good.groups
name=FldSeq/stability.trim.contigs.good.names
qfile=FldSeq/stability.trim.contigs.qual
contigsreport=FldSeq/stability.contigs.report
count=FldSeq/stability.trim.contigs.good.count_table
processors=4
summary=FldSeq/stability.trim.contigs.good.unique.summary
file=FldSeq/stability.files

mothur > summary.seqs(fasta=silva.v4.align)

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 9876 219 0 3 1
2.5%-tile: 1 13425 292 0 4 4767
25%-tile: 1 13425 293 0 4 47666
Median: 1 13425 293 0 5 95331
75%-tile: 1 13425 293 0 5 142996
97.5%-tile: 1 13425 459 1 6 185895
Maximum: 4225 13425 1521 5 16 190661
Mean: 1.27395 13424.8 309.097 0.048164 4.75374

of Seqs: 190661

Output File Names:
FldSeq/silva.v4.summary

It took 19 secs to summarize 190661 sequences.

mothur > align.seqs(fasta=stability.trim.contigs.good.unique.fasta, reference=silva.v4.align)

Using 4 processors.

Reading in the FldSeq/silva.v4.align template sequences… DONE.
It took 66 to read 190661 sequences.
Aligning sequences from FldSeq/stability.trim.contigs.good.unique.fasta …
100
6874
It took 119 secs to align 27495 sequences.


Output File Names: FldSeq/stability.trim.contigs.good.unique.align FldSeq/stability.trim.contigs.good.unique.align.report
mothur > summary.seqs(fasta=stability.trim.contigs.good.unique.align, count=stbility.trim.contigs.good.count_table)

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 10644 250 0 3 1
2.5%-tile: 1 13424 291 0 4 1093
25%-tile: 1 13424 292 0 5 10926
Median: 1 13424 292 0 5 21851
75%-tile: 1 13424 292 0 5 32776
97.5%-tile: 1 13424 292 0 6 42608
Maximum: 3749 13425 300 0 9 43700
Mean: 10.9908 13423.7 291.91 0 5.14522

of unique seqs: 27495

total # of seqs: 43700

Output File Names:
FldSeq/stability.trim.contigs.good.unique.summary

It took 3 secs to summarize 43700 sequences.

mothur > screen.seqs(fasta=stability.trim.contigs.good.unique.align, count=staility.trim.contigs.good.count_table, summary=stability.trim.contigs.good.unique.summary, start=1, end=13425, maxhomop=8)

Using 4 processors.
Processing sequence: 100
Processing sequence: 6873

Output File Names:
FldSeq/stability.trim.contigs.good.unique.good.summary
FldSeq/stability.trim.contigs.good.unique.good.align
FldSeq/stability.trim.contigs.good.unique.bad.accnos
FldSeq/stability.trim.contigs.good.good.count_table


It took 3 secs to screen 27495 sequences.

mothur > summary.seqs(fasta=current, count=current)
Using FldSeq/stability.trim.contigs.good.good.count_table as input file for the count parameter.
Using FldSeq/stability.trim.contigs.good.unique.good.align as input file for the fasta parameter.

Using 4 processors.

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 13425 291 0 4 1
2.5%-tile: 1 13425 291 0 4 1
25%-tile: 1 13425 293 0 5 9
Median: 1 13425 293 0 5 17
75%-tile: 1 13425 293 0 6 25
97.5%-tile: 1 13425 294 0 8 33
Maximum: 1 13425 294 0 8 33
Mean: 1 13425 292.879 0 5.27273

of unique seqs: 33

total # of seqs: 33

Output File Names:
FldSeq/stability.trim.contigs.good.unique.good.summary

It took 0 secs to summarize 33 sequences.

mothur > filter.seqs(fasta=stability.trim.contigs.good.unique.good.align, vertcal=T, trump=.)

Using 4 processors.
Creating Filter…
8
8
8
9


Running Filter... 8 8 8 9

Length of filtered alignment: 297 Number of columns removed: 13128 Length of the original alignment: 13425 Number of sequences used to construct filter: 33

Output File Names:
FldSeq/stability.filter
FldSeq/stability.trim.contigs.good.unique.good.filter.fasta

Hello,

So if your sequences are still present and accounted for before the align.seqs step, can you track where they get dropped? This will tell you why they’re being dropped.

Are these sequences different enough from the others that they are not absorbed into another OTU during clustering (or even pre-clustering) ? Are they being thrown away as being chimeric? Are they not being classified properly and therefore being discarded? etc…

Cheers
Richard

Is it possible that the sequences are identical within the V4 region?