screen.seqs removes most sequences

Hi,

I am currently running through the MiSeqSOP. I have made a custom 16S v4 reference file from the silva.bacteria.fasta available from the mothur website then aligned my sequences:

_#Download a complete 16S bacterial sequence from NCBI as fasta format and add the primers used for amplification of the 16S gene for the NGS data generation to the 16S sequence fasta file

16S_serratia

#Align the primers on the 16S sequence using Seaview with the Clustal algorithm, then cut the region of 16S sequence targeted by your primers, and save it as a fasta file
e.g. in this case the v4 region: 16Sv4_serratia
#In Mothur
#Align the targeted 16S region sequence to the Mothur reference file (silva.bacteria.fasta)_

align.seqs(fasta=/home/chris/Desktop/16Sv4_serratia, reference=/media/chris/Storage/Genomes_Transcriptomes_Proteomes/16S_microbial_sequences/Silva/Mothur/silva.bacteria/silva.bacteria.fasta, processors=8)

#Summarise the alignment

summary.seqs(fasta=/home/chris/Desktop/16Sv4_serratiaalign, processors=8)

[/i]
Start End NBases Ambigs Polymer NumSeqs
Minimum: 219 0 4 1
2.5%-tile: 6119 13861 219 0 4 1
25%-tile: 6119 13861 219 0 4 1
Median: 6119 13861 219 0 4 1
75%-tile: 6119 13861 219 0 4 1
97.5%-tile: 6119 13861 219 0 4 1
Maximum: 6119 13861 219 0 4 1
Mean: 6119 13861 219 0 4

of Seqs: 1[/size]

Based on the information from the summary.seqs (above) I get the start and end positions where most of the reads will align to the reference, and can therefore use pcr.seqs as in the MiSeqSOP to make a region targeted reference alignment which will allow us to have smaller .align files.

pcr.seqs(fasta=/media/chris/Storage/Genomes_Transcriptomes_Proteomes/16S_microbial_sequences/Silva/Mothur/silva.bacteria/silva.bacteria.fasta, start=6119, end=13861, processors=8, keepdots=F)

#Rename the file SILVAv4.fasta[/i]

I then align my data to the custom alignment:

align.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/5_unique.seqs/merged16S.good.unique.fasta, reference=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_pcr.seqs/SILVAv4.fasta, processors=8)

[/i]

After the alignment I have 875936 sequences (see below) though some don´t align exactly at the same position:

summary.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/5_unique.seqs/merged16S.good.unique.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/6_count.seqs/merged16S.good.count_table, processors=8)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 21899
25%-tile: 7028 7742 8 0 2 218985
Median: 7028 7742 8 0 2 437969
75%-tile: 7028 7742 8 0 2 656953
97.5%-tile: 7736 7742 54 0 3 854038
Maximum: 7742 7742 119 0 18 875936
Mean: 6250.26 7088 10.6911 0 2.0833

of unique seqs: 392343

total # of seqs:[/i]

Following the SOP I then use screen.seqs to remove those badly aligning sequences, but I find myself with only 299016 sequences after…(see steps below):

screen.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_align.seqs/merged16S.good.unique.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/6_count.seqs/merged16S.good.count_table, summary=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_align.seqs/merged16S.good.unique.summary, start=7028, end=7742, maxhomop=8, processors=8)

#Check that all required undesired sequences were removed:

summary.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/9_screen.seqs/merged16S.good.unique.good.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/9_screen.seqs/merged16S.good.good.count_table, processors=8)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 2067 7742 8 0 1 1
2.5%-tile: 4118 7742 8 0 2 7476
25%-tile: 7028 7742 8 0 2 74755
Median: 7028 7742 8 0 2 149509
75%-tile: 7028 7742 8 0 2 224263
97.5%-tile: 7028 7742 55 0 3 291541
Maximum: 7028 7742 113 0 8 299016
Mean: 6692.58 7742 13.0448 0 2.05437

of Seqs: [/i]

I do not understand how/why screen.seqs …???

Having checked on the mothur forum, I saw that someone else was having a somewhat similar problem (http://mothur.ltcmp.net/t/screen-seqs-removed-47-of-my-sequences/1429/1), and that it was suggested that for this iteration of the screen.seqs command, he should swap the positions specified for the “start” and “end” arguments. If I try to do so I do indeed have much less sequences removed but I am not sure that I am doing the right thing.

summary.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_align.seqs/merged16S.good.unique.good1.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/8_align.seqs/merged16S.good.good.count_table1, processors=8)

_Start End NBases Ambigs Polymer NumSeqs
Minimum: 2067 7740 1 0 1 1
2.5%-tile: 4118 7742 4 0 2 19916
25%-tile: 7028 7742 8 0 2 199157
Median: 7028 7742 8 0 2 398313
75%-tile: 7028 7742 8 0 2 597469
97.5%-tile: 7736 7742 54 0 2 776710
Maximum: 7742 7742 113 0 8 796625
Mean: 6846.2 7742 10.9162 0 2.01156

of unique seqs: 345671

total # of seqs: _

Can anyone help me get my head around this and so that I can understand how to correctly use the screen.seqs command, and help me understandwhat I am doing wrong?

Thank you,

Chris

Chris, I wonder whether your sequences are backwards. Your sequences lengths are very short after the alignment which typically happens after aligning backwards sequences.

pat

Hi Pat,

Thanks for your quick reply. I am not sure whether the reads are backwards or not. It would indeed be a possibility, since the screen.seqs seems to remove a more logical amount of sequences (leaving 796625 instead of 299016) if I run screen.seqs with start=7742, end=7028 instead of start=7028, end=7742. How could I determine their orientation? The data were supplied to me as demultiplexed files, so I can’t use the position of the adapters/primers to determine this unfortunately.

Is there an alternative explanation/solution if the sequences are forward?

If the sequences are backwards as you suggested, could this have implications for previous commands? For example shouldn’t the align.seqs command be run with flip=t? And shouldn’t this also be checked and taken into account right from the start, i.e. make.contigs, for which I have read in a post that you were planning to add a checkorient option to be included in the next release of mothur?

Also, earlier on in the SOP I have seen that the assembled sequences from make.contigs showed a whole range of sizes:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 120 120 0 3 1
2.5%-tile: 1 191 191 0 4 26935
25%-tile: 1 270 270 0 4 269350
Median: 1 270 270 0 4 538700
75%-tile: 1 270 270 0 6 808050
97.5%-tile: 1 434 434 8 11 1050465
Maximum: 1 479 478 79 236 1077399
Mean: 1 271.99 271.99 0.703345 5.2692

of Seqs: 1077399

Because I knew that my MiSeq reads were 250 nucleotides long, and that the v4 primers I used (515F and 806R from the Earth Microbiome Project) amplify a 291bp fragment, I knew that the ~191-270bases long sequences were potentially correctly assembled as they were of a logical length, and that shorter/longer sequences were likely artifacts from the amplification, sequencing or alignment. I have therefore used the screen.seqs command to remove shorter/longer sequences :

screen.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/3_merge.files/merged16S.fasta, minlength=190, maxlength=310, maxambig=0, group=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/groups, summary=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/3_merge.files/merged16S.summary, processors=8)

Could this variety of sequence lengths observed also be improved if using ‘flip’ during make.contigs?

Thanks very much for your help!

Chris

Can you rerun align.seqs with flip=T and see what the summary.seqs output looks like? You don’t want end < start in screen.seqs.

pat

Here is the ouput of align.seqs then summary.seqs with flip=T

align.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/5_unique.seqs/merged16S.good.unique.fasta, reference=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/7_pcr.seqs/silva.bacteria.v4.fasta, flip=t, processors=8)
summary.seqs(fasta=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/5_unique.seqs/merged16S.good.unique.align, count=/home/chris/Bioinformatics/Data_Analysis/Guam_2013/MiSeq/Mothur/6_count.seqs/merged16S.good.count_table, processors=8)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: -1 -1 0 0 1 21899
25%-tile: 7028 7742 8 0 2 218985
Median: 7028 7742 8 0 2 437969
75%-tile: 7028 7742 8 0 2 656953
97.5%-tile: 7033 7742 54 0 4 854038
Maximum: 7742 7742 119 0 18 875936
Mean: 6018.45 6854.11 11.2892 0 2.11309

of unique seqs: 392343

total # of seqs: 875936

So what exactly are you sequencing - region? specimen? library generation method?

I’m sequencing the v4 region of the 16S rRNA (using the primers from the Earth Microbiome Project) from both RNA and DNA extracts from coral, sediments, water samples. Here is the detailed procedure:

Total RNA extract was reverse transcribed using the High-Capacity cDNA Reverse Transcription Kit (Life Technologies, Grand Island, NY) according to the manufacturer’s instructions, using primer 806R (GGACTACHVGGGTWTCTAAT) to prime the reaction. Subsequently, the generated complementary DNA (cDNA) was PCR amplified with primers 515F and 806R (e.g. Caporaso et al. 2012), targeting the V4 variable region of Bacterial and Archaeal small subunit (SSU) ribosomal RNA (rRNA) gene. The primers contained 5’ common sequence tags (known as common sequence 1 and 2, CS1 and CS2) as described previously (e.g. Moonsamy et al. 2013). The forward primer, CS1_515F (ACACTGACGACATGGTTCTACAGTGCCAGCMGCCGCGGTAA) and CS2_806R (TACGGTAGCAGAGACTTGGTCTGGACTACHVGGGTWTCTAAT) were synthesized by Integrated DNA Technologies (IDT; Coralville, Iowa) as standard oligonucleotides.

PCR amplifications were performed in 20 microliter reactions in 96-well plates. A mastermix for the entire plate was made using the 2X AccuPrime SuperMix II (Life Technologies, Gaithersburg, MD). Final concentration of CS1_515F and CS2_806R primers was 500 nM. From 10-100 ng of genomic DNA was added to each PCR reaction. Cycling conditions were as follows: 95°C for 5 minutes, followed by 28 cycles of 95°C for 30”, 55°C for 45” and 68°C for 30”. A final, 7 minute elongation step was performed at 68C. A positive control sample of a complex genomic DNA sample was added to each plate, in addition to a no-template control reaction. Each sample was analyzed using agarose gel electrophoresis employing the E-gel system (Life Technologies). Reactions were verified to contain visible amplification, in addition to no visible amplification in the no-template control prior to the second stage of PCR amplification.

A second PCR amplification was performed in 10 microliter reactions in 96-well plates. A mastermix for the entire plate was made using the 2X AccuPrime SuperMix II. Each well received a separate primer pair, obtained from the Access Array Barcode Library for Illumina Sequencers. The final concentration of each primer concentration was 400 nM, and each well received a separate primer set with a unique 10-base barcode (Fluidigm, South San Francisco, CA; Item# 100-4876 (Illumina)). Separate reactions with unique barcodes were included for positive control, no-template control (reaction 1) and a second no-template control reaction containing only Access Array Barcode library primers. Cycling conditions were as follows: 95°C for 5 minutes, followed by 28 cycles of 95°C for 30”, 60°C for 30” and 68°C for 30”. A final, 7 minute elongation step was performed at 68C. PCR yield of positive and negative controls and select samples were validated with Qubit fluorometric quantitation with the Qubit 2.0 fluorometer (Life Technologies) and with size and quantification employing an Agilent TapeStation2200 device with D1000 ScreenTape (Agilent Technologies, Santa Clara, California). After assessing no amplification in the negative controls, samples were pooled in equal volume and purified using solid phase reversible immobilization (SPRI) cleanup, implemented with AMPure XP beads at a ratio of 0.6X (v:v) SPRI solution to sample. This ratio removes DNA fragments shorted than 300 bp from the pooled libraries. Final quality control is performed using TapeStation2200 and Qubit analysis.

Does that help in understanding why screen.seqs filters out so many sequences?

Hmmm… I’m perplexed because even once your sequences are aligned - in either direction - they are very short. The V4 region is ~250 bp. I know this sounds dumb, but could you take one of your contigs and blast it at NCBI or classify it at RDP?

Hmmm, the output seems a bit weird. I’ve annotated a few randomly selected contigs from the unique.seqs output and here is what I got:

HWI-M02808_22_A7NR7_1_1101_13733_8334
TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTGACGACATGGTTCTACAGTGCCAGCAGCCGCGGTAAAACAAATGGTGTAAACCTTCTATAGGTTAATTAGAAACCCCTGTAGTCCAGACCAAGTCTCTGCTACCGTAACATAGTATCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAA
NCBI: Uncultured prokaryote gene for 16S rRNA, partial sequence, clone: NNCTHY7598
RDP: Coemansia reversa; NRRL1564; AFTOL-ID 140; AY546689
HWI-M02808_22_A7NR7_1_1101_13809_3933
ACCGAGATCTACACTGACGACATGGTTCTACAGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGTTGAGGATGACCGGTCCGCCCTCTGGGTGTGCATCTGGCTCAGCCTTGACATCTTCCTGAAGAACGTATCTGCACTTGATTGTGTGGTGCGGTATTTGGGACATTTACCTTGAGGAAATTAGAAACCCGTGTAGTCCAGACCAAGTCTCTGCTACCGTAATGTATAGTC
NCBI: Symbiodinium sp. clade D 18S ribosomal RNA gene, partial sequence
RDP: Rhizoclosmatium globosum; JEL06; AY988506
HWI-M02808_22_A7NR7_1_1101_12160_4171
CGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGGATGGCCCGGCCGGTCTGCCGCGAGGTACGTTACTGGCCGAGCTGTTCTTCCTCGCCGAGACCGAGTGTGCTCTTAACTGAGTGTGCTTGGGATCTGCGACGTTTACTTTGAAAAAATTAGAGTGTTCAAAGCAAGCCGGCGCCTGAATACGTAAGCATGGAATAATGGAATAGGATTTTGGTTCTATTTTGTTGGTTTCTGGAACCGAAGTAATGATTGAGAGAGACAGTTGGGGGCATTCGTATTTCGCTGTCAGAGGTGAAATTCTTGGATTTGCGAAAGACGAACTACTGCGAAAGCATTTGCCAAGAATGTTTTCATTAATCAAGAACGAAAGTTAGAGGATCGAAGACGATTAGATAC
NCBI: Pocillopora meandrina 18S small subunit ribosomal RNA gene and internal transcribed spacer 1, partial sequence
RDP: Candida auringiensis; NRRL Y-17674; DQ438225
HWI-M02808_22_A7NR7_1_1101_17025_5465
GAGATCTACACTGACGACATGGTTCTACAGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGGATGGCCCGGCCGGTCTGCCGCGAGGTACGTTACTGGCCGAGCTGTTCTTCCTCGCCGAGACCGAGTGTGCTCTTAACTGAGTGTGCTTGGGATCTGCGACGTTTACTTTGAAAAAATTAGAAACCCTAGTAGTCCAGACCAAGTCTCTGCTACCGTAATGTATAGTCA
NCBI: Pocillopora meandrina 18S small subunit ribosomal RNA gene and internal transcribed spacer 1, partial sequence
RDP: Rhizoclosmatium globosum; JEL06; AY988506
HWI-M02808_22_A7NR7_1_1101_18790_12564
TTTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTGACGACATGGTTCTACAGTGCCAGCCGCCGCGGTAAAACAAATGGTGTAAACCTTCTATAGGTTAATTAGATACCCCTGTAGTCCAGACCAAGTCTCTGCTACCGTAATGTATAGTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAATTCT
NCBI: Uncultured prokaryote gene for 16S rRNA, partial sequence, clone: NNCTHY1074
RDP: Coemansia reversa; NRRL1564; AFTOL-ID 140; AY546689
HWI-M02808_22_A7NR7_1_1101_5616_15286
CAGCCGCCGCGGTAAGACTTAAGGATTTATTTTTTTAATAAAAAGCAAAAAGCGTGTTAAGGATTTTTTAAAAAAAAAAATAAATAGAATTTTTTTCGTAATTGTAATATGTTAAAATGAAAAAAAGAATTTTTTATATGAAGATAATTTATTTTTTTTTTCTTAAATACGAAGGTTTGGGGAGCAAATAGGATTAGAAACCCTGGTAGTCCAGACCAAGTCTCTGCTACCGTAATGTATAGTCAT
NCBI: Pocillopora meandrina isolate PMD 12S small subunit ribosomal RNA gene, partial sequence; mitochondrial
RDP: Campylobacter sputorum (T); biovar sputorum;LMG 7795; X67775

Hi all

I have a very similar problem, screen.seqs also removes a big part of my sequences (not most of them but still a lot)
I have the impression that sth is going wrong during the align.seqs but I havent been able to figure out what

I am using the same primers and amplify the same area as CSheridan (515F/806R)

I tried to run align.seqs in different ways: first I didnt run the pcr.seqs command but instead I aligned against the whole silva.bacteria.fasta (the reason I did that was because I wasnt sure at which region should I modify the alignment), after running summary.seqs I got
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 10307 25434 297 0 4 125718
25%-tile: 10364 25434 301 0 4 1257174
Median: 11888 25440 306 0 4 2514347
75%-tile: 11888 25452 310 0 5 3771520
97.5%-tile: 11888 25495 318 0 5 4902976
Maximum: 43116 43116 330 0 46 5028693
Mean: 11310 25456.8 305.437 0 4.43622

of unique seqs: 716721

total # of seqs: 5028693

I saw that there is some variability in the alignment but I just kept going and then I ran the screen seqs
screen.seqs(fasta=Actinotrim.contigs.good.unique.align, count=Actinotrim.contigs.good.count_table, summary=Actinotrim.contigs.good.unique.summary, start=10364, end=25495, maxhomop=8, processors=8)
but after running the screen seqs at the end it tells me that about 100/130samples have been removed cause there are no sequences left!

"…
Removing group: sample92 because all sequences have been removed.

Removing group: sample94 because all sequences have been removed.

Removing group: sample95 because all sequences have been removed.

Removing group: sample96 because all sequences have been removed.

Removing group: sample97 because all sequences have been removed.

Removing group: sample99 because all sequences have been removed.

Output File Names:
shit/Actinotrim.contigs.good.unique.good.summary
shit/Actinotrim.contigs.good.unique.good.align
shit/Actinotrim.contigs.good.unique.bad.accnos
shit/Actinotrim.contigs.good.good.count_table


It took 199 secs to screen 716721 sequences.

mothur >
"

I thought that maybe I should flip the sequences during the align.seqs so I ran again the align.seqs with flip=T, but I got exactly the same results as before

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 10307 25434 297 0 4 125718
25%-tile: 10364 25434 301 0 4 1257174
Median: 11888 25440 306 0 4 2514347
75%-tile: 11888 25452 310 0 5 3771520
97.5%-tile: 11888 25495 318 0 5 4902976
Maximum: 43116 43116 330 0 46 5028693
Mean: 11310.9 25457.7 305.452 0 4.43769

of unique seqs: 716721

total # of seqs: 5028693


After that I ran the screen.seqs command but instead of fixing the position with "start=10364, end=25495" I put "optimize=start-end", that seemed to have fixed things for a while since it didnt remove most of my samples but when I finished all the processing and I wanted to start having fun with the OTUs, the dist.seqs and /or the cluster.slpit were running for an eternity till my hard drive was full (creating temps more than 300GB)
it might be sth pretty stupid that I am doing wrong and I apologize in advance for that but this is my first time analyzing MiSeq data and any help would be greatly appreciated!

Thanks Panos

It looks like there may be something wrong with the data itself. To get a quick idea of what the data looked like when assigned taxonomically I uploaded the various data sets to MG-RAST, and it looks like the majority of the sequences come from eukaryotic origin. So something must have gone wrong with the RT or the PCR that was run on the cDNA. I will attempt to remove the problematic data sets and see if I encounter the same problem again.

I think I figured (at least partially) what was wrong with my data, but still there are several things not working :s
as I wrote in the previous post I wasnt using a custom silva.bacteria alignment but I was using either the whole alignment or one that was extending beyond the v4 region
so when I used the pcr.seqs and cut at the same positions as in the sop forum, there was no removal of most of my sequences

however there are 1-2 things I dont get,
this first one is minor: in the sop tutorial it shows that if you align against the custom silva.v4 alignment then in the summary seqs, it shows sth like this
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1250 10693 250 0 3 1
2.5%-tile: 1968 11550 252 0 3 3227
25%-tile: 1968 11550 252 0 4 32265
Median: 1968 11550 252 0 4 64530
75%-tile: 1968 11550 253 0 5 96794
97.5%-tile: 1968 11550 253 0 6 125832
Maximum: 1982 13400 270 0 12 129058
Mean: 1967.99 11550 252.462 0 4.36663

which makes sense as it is a modified alignment and the regions before and after it have been removed,
however in my printout of the screen seqs I get sth like that:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 11895 25319 292 0 4 125718
25%-tile: 11895 25319 292 0 4 1257174
Median: 11895 25319 293 0 4 2514347
75%-tile: 11895 25319 293 0 5 3771520
97.5%-tile: 11895 25319 293 0 5 4902976
Maximum: 25319 25319 305 0 45 5028693
Mean: 11918.8 25311.9 291.971 0 4.43503

of unique seqs: 716721

total # of seqs: 5028693

the alignment seems to be fine but still, it says that my sequences start getting aligned at position 11895… how is that possible?
then I continue normally with screen.seqs and again after that step I get the same thing:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 11895 25319 268 0 3 1
2.5%-tile: 11895 25319 292 0 4 125220
25%-tile: 11895 25319 292 0 4 1252194
Median: 11895 25319 293 0 4 2504388
75%-tile: 11895 25319 293 0 5 3756581
97.5%-tile: 11895 25319 293 0 5 4883555
Maximum: 11895 25319 305 0 8 5008774
Mean: 11895 25319 292.644 0 4.43981

of unique seqs: 703114

total # of seqs: 5008774



I didnt get much troubled about that, I m not sure if I should? but most importantly now I have a new problem, it seems that I have too many sequences (or too many garbage in my sequences) and therefore when I finish the processing of the sequences and I want to start the actual analysis (OTU making), the dist.seq runs forever, till there is no space in my hard drive (it fills >300gb in 3-4hr) and of course then I cant continue...

my sequences are miseq sequencing of guts from fungus-growing ants, I know that in those ant guts there are 3-5 abundant OTUs (95-98%) and maybe some traces of other ones
To process my sequences I follow the tutorial and I ran initially the make.contigs command
I understand that probably I will have to be more strict with my sequences but not sure at which step should I do that?
I end up after the chimera check to have 131139 sequences in the fasta file
btw I run mothur 1.33.3

and I apologise for posting in this forum, I understand I should move to another one since my problem is not the screen.seqs anymore but since my initial problem was the screen.seqs I thought I should continue posting here


Best Panos

Chris…

I’m a bit concerned that what you are sequencing is primer-dimer. Did you guys run this out on an agilent to see if your fragments are the expected size?

Panos…

Could you start over in a new thread? You’re asking a bunch of questions and I’m unclear how exactly you’re running the commands and how you’re doing your sequencing.

Pat

Hi Pat

thanks for the reply
I have made a new thread here:
http://mothur.ltcmp.net/t/cluster-and-cluster-split-error/1915/1

cheers
Panos

Hi
I am analyzing a subsample of 500 seqs from each of my samples to find the right parameters and learn roughly about the quality of my run, as is the first time I use the SOP with 454 data.
I sequenced the v3-5 region and the barcodes and linker were in my fwd primer.
But, when I aligned the unique seqs to the silva database, it failed and I had to add the flip=T. Then, all the sequences were aligned in the reverse complement form (??)
Then, when I ran screen.seqs, almost all the sequences were removed

Summary after unique.seqs:
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 422 422 0 3 1
2.5%-tile: 1 462 462 0 4 292
25%-tile: 1 480 480 0 4 2913
Median: 1 493 493 0 5 5826
75%-tile: 1 504 504 0 5 8739
97.5%-tile: 1 526 526 0 7 11360
Maximum: 1 567 567 0 8 11651
Mean: 1 492.572 492.572 0 4.90653

of unique seqs: 9737

total # of seqs: 11651

mothur > align.seqs(fasta=454downsample.shhh.trim.unique.fasta, reference=silva.bacteria.fasta, flip=T, processors=2)
mothur > summary.seqs(fasta=454downsample.shhh.trim.unique.align, name=454downsample.shhh.trim.unique.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 1065 9 0 2 1
2.5%-tile: 6428 25513 462 0 4 292
25%-tile: 6428 26387 480 0 4 2913
Median: 6428 26830 493 0 5 5826
75%-tile: 6428 26918 504 0 5 8739
97.5%-tile: 6428 27159 526 0 7 11360
Maximum: 43100 43116 567 0 8 11651
Mean: 6452.79 26683.2 492.19 0 4.90456

of unique seqs: 9737

total # of seqs: 11651

mothur > screen.seqs(fasta=454downsample.shhh.trim.unique.align, name=454downsample.shhh.trim.unique.names, group=454downsample.shhh.groups, end=27159, optimize=start, criteria=95, processors=2)
mothur > summary.seqs(fasta=454downsample.shhh.trim.unique.good.align, name=454downsample.shhh.trim.unique.good.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 6426 27159 506 0 4 1
2.5%-tile: 6428 27159 507 0 4 8
25%-tile: 6428 27163 516 0 5 71
Median: 6428 27581 521 0 5 141
75%-tile: 6428 28426 530 0 5 211
97.5%-tile: 6428 28462 545 0 7 273
Maximum: 6428 29665 567 0 7 280
Mean: 6427.99 27677 523.025 0 5.12143

of unique seqs:234

total # of seqs: 280

Any idea what’s happening??

Thanks!

You want start=6428, optimize=end

I have tried that before and didnt work, I was doing 900 min and maxflows also, but then I did as you suggested with 1000, and then tried again as you said here setting start and optimizing end and it worked! may be I did sth wrong before? but the good thing is that it worked! hope no to face further problems and get ready to jump to analysis!
Thank you very much Pat!

Hi all,
i have a similar problem and i have tried everything in this post.

mothur > align.seqs(fasta=NP_197samples.trim.trim.unique.fasta, reference=/store_data/ashraf/outdir/Data/Intensities/BaseCalls/R1/transformed/mothur/silva.bacteria/silva.bacteria.fasta,flip=T)

mothur > summary.seqs()

Start End NBases Ambigs Polymer NumSeqs
Minimum: -1 -1 0 0 1 1
2.5%-tile: 6332 13859 197 0 4 163831
25%-tile: 6333 13862 197 0 4 1638302
Median: 6333 14956 197 0 4 3276603
75%-tile: 6333 14961 197 0 5 4914904
97.5%-tile: 6333 14963 197 0 7 6389375
Maximum: 43116 43116 197 6 124 6553205
Mean: 6486.25 14559.3 195.521 0.00086782 4.63685

of Seqs: 6553205

mothur > screen.seqs(fasta=NP_197samples.trim.trim.unique.align, count=NP_197samples.trim.trim.count_table, summary=NP_197samples.trim.trim.unique.summary, start=6332, end=14963,processors=12)
mothur > summary.seqs(fasta=current,count=current)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 6326 14963 197 0 3 1
2.5%-tile: 6332 14963 197 0 4 896
25%-tile: 6332 14963 197 0 4 8955
Median: 6332 14963 197 0 4 17910
75%-tile: 6332 14963 197 0 5 26864
97.5%-tile: 6332 14965 197 0 6 34923
Maximum: 6332 15649 197 3 18 35818
Mean: 6331.99 14963.2 197 0.000558378 4.52328

of unique seqs: 20020

total # of seqs: 35818

what couse this sequence number reduction from over 6 million reads to 35 K reads? what is that i did wrong?

just foget to mention that the analysis of the previous run is from a 197 samples of different water bodies using PE 300 illumina for v3v4 region (341F,806R). however i only used the R1 read since R2 had a bad quality and lots of N’s, beside i only kept the first 197 bp because the quality after this specific position drooped and contained a lot of N’s.

Ashraf