Hi,
I am not sure if this has been asked before, but when I do align.seqs, a lot of sequences are removed from fasta file which creates discrepancy between fasta file and name/group files. For instance, if I run uchime downstream of this,
“[ERROR]: sequenceA is in your name file and not in your fasta file, please correct.”
comes up multiple times and creates errors.
NB these are not amplicons, I just picked out 16S rRNA gene and would like to classify them.
How can we avoid this, or is there a way around it?
I have posted the output as below;
mothur > summary.seqs(fasta=current, name=current)
Using 16S_pandaseq.unique.fasta as input file for the fasta parameter.
Using 16S_pandaseq.names as input file for the name parameter.
Using 1 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 47 47 0 1 1
2.5%-tile: 1 88 88 0 3 707378
25%-tile: 1 111 111 0 4 7073777
Median: 1 138 138 0 4 14147554
75%-tile: 1 155 155 0 5 21221331
97.5%-tile: 1 168 168 0 6 27587730
Maximum: 1 170 170 3 86 28295107
Mean: 1 133.095 133.095 0.000118854 4.14909
of unique seqs: 10402292
total # of seqs: 28295107
Output File Names:
16S_pandaseq.unique.summary
It took 253 secs to summarize 28295107 sequences.
mothur > align.seqs(fasta=current, reference=/mothurfiles/silva.nr_v123.align, flip=T, processors=8)
Using 16S_pandaseq.unique.fasta as input file for the fasta parameter.
Using 8 processors.
Reading in the /mothurfiles/silva.nr_v123.align template sequences… DONE.
It took 264 to read 172418 sequences.
Aligning sequences from 16S_pandaseq.unique.fasta …
Some of you sequences generated alignments that eliminated too many bases, a list is provided in 16S_pandaseq.unique.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 45468 secs to align 10402292 sequences.
Output File Names: 16S_pandaseq.unique.align 16S_pandaseq.unique.align.report 16S_pandaseq.unique.flip.accnos
mothur > summary.seqs(fasta=current, name=current) Using 16S_pandaseq.unique.align as input file for the fasta parameter. Using 16S_pandaseq.names as input file for the name parameter.
Using 8 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: -1 -1 0 0 1 1
2.5%-tile: 1046 1460 10 0 2 476999
25%-tile: 9815 14974 100 0 3 4769984
Median: 21926 26171 127 0 4 9539968
75%-tile: 34129 38347 152 0 5 14309952
97.5%-tile: 43021 43116 168 0 6 18602937
Maximum: 43116 43116 170 2 78 19079935
Mean: 21458.5 25246.9 121.296 8.24426e-05 4.06886
of unique seqs: 5161509
total # of seqs: 19079935
Output File Names:
16S_pandaseq.unique.summary
It took 3430 secs to summarize 19079935 sequences.
Thanks,
Sou