Unique nseq & a lot of "Bacteria; unlcassified"

Hi everyone.

I’m surely not the first one here who mentions that mothur is awesome and it’s great to work with it. The documentation is perfect, the SOPs are great, thanks for the wiki and having Sarah, Patrick, and many other users here in the forum, giving great support, is beyond everything every user could expect. Thanks!

I’m not sure where to start with my questions. It’s a mix of verifying the choice of my mothur commands and to make sense of some results. Please correct me whenever I’m wrong.

I have data from 26 samples associated with marine invertebrates. Sequenced with MiSeq 2 x 250 paired-end, MS-102-2003, MiSeq Reagent Kit v2 (500 cycle), Cluster density 870K, PhiX 10% spike-in ; removed from the data @ Macrogen. I’ve used the EMP bacteria/archaeal primers 515F/806R. Hence, I think that with this choice there should be no problem with paired-end overlaps?
After reading through a lot of threads here I feel like becoming a hypochondriac concerning my approach and results. So, here are my questions / problems:

1.) I have ~ 10 000 000 total sequences. After aligning I have ~ 1 300 000 unique sequences. The number of unique sequences goes down to ~ 300 000 after pre.cluster. I have no experience so far with MiSeq data - only with 454 - and so many reads. Hence, my first question is: does it look normal? Or is here anything which sets the alarm bells ringing?

2.) I’ve read in the mothur blog entry (http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix%3F/) and here in the forum, that having a huge distance matrix can be caused by not fully overlapping reads. My first distance matrix was huge (I remember that it was something around 100 GB //edit: but I’m not sure anymore - it was large and I generate another one at the moment). Because of this, and a rather small computer (8 cpus, 16 GB memory, 32 GB swap file, 600 GB HD), I went the cluster.split way (taxlevel 4, method furthest, splitmethod classify) instead of dist.seqs & shared. My question: is a ~ 100 GB distance matrix normal with around 300 000 sequences? Is there a usual number of sequences / size relationship? I would expect that the sequence length and overall similarity also accounts for the size of the matrix.

3.) After classify.seqs and remove.lineage (gg or silva) I discovered that around ~ 3 500 000 of my ~ 10 000 000 sequences are unclassified bacteria. Ugh. The rest of the taxonomy looks fine to me. Heaps of Proteobacteria, nothing unexpected or ‘out of the order’. But that large number of unclassified is somehow terrifying to me.
I’ve compared some of the well classified sequences (down to genus or species) with some of the unclassified Bacteria sequences by copy & past some of them randomly into seaview (sequence alignment viewer). Then I de-aligned and locally aligned them and did a Mark I Eyeball test - the as unclassified sequences do not look screwed or very different compared to the well classified. A subsequent blast search showed that the ‘good ones’ have better blast hits than the ‘bad ones’, but are still microbial (with a lower similarity) and nothing totally wrong.
To expand this - after cluster.split, make.shared and classify.otu I’ve got ~ 38 000 (0.03) Otus. The most abundant Otu contains ~ 1 000 000 reads and is one of the unclassified Bacteria. Followed by the 2nd Otu with ~ 800 000 reads, a Proteobacteria, classified down to order. To sum this up: the top 18 Otus, sorted by sequence abundance, contains ~ 6 000 000 reads and five of them are these unclassified bacterial Otus. So, what is my question? Can this be? One or more so abundant Otus, but no certain classification? I’m not experienced enough to understand if this is really a possible outcome or a problem with my pipeline or the sequencing.

Bonus questions:
Is the “Start 1 & End 572” length of the filtered alignment normal for the ~250 bp V4 region?
Is everything fine and are there really so many unclassified bacs in my data?
What can I do to verify that everything is fine or totally wrong?
Can / should I perform the get.oturep (method abundance) command without the distance matrix to see what these Otus are?

I have performed several mothur runs on this data (following the MiSeq SOP, with only slight changes here and there), and I will try to give you a consensus description of my workflow and summaries:

make.contigs(file=cots.files, processors=8)
summary.seqs(fasta=cots.trim.contigs.fasta, processors=8)

######### excurse to trim primers #############
#http://mothur.org/forum/viewtopic.php?p=8551 - http://w.mothur.org/forum/viewtopic.php?f=3&t=3047&p=8551&sid=94438e15ff6c9c1a0502ed7010a50ed0
####################################
trim.seqs(fasta=cots.trim.contigs.fasta, oligos=cots.oligo, pdiffs=2, maxambig=0, minlength=200, maxlength=300, qwindowaverage=35, qwindowsize=50, processors=8) # won't create a group since you don't have any names

# list trimmed sequences 
list.seqs(fasta=cots.trim.contigs.trim.fasta)

# will make sure any sequences that were removed due to the primer are removed from the group file.
get.seqs(group=cots.contigs.groups, accnos=cots.trim.contigs.trim.accnos)
#####################################

######### from here the usual way #################
screen.seqs(fasta=cots.trim.contigs.trim.fasta, group=cots.contigs.pick.groups, maxambig=0, maxlength=275, processors=8)
summary.seqs(fasta=current)

unique.seqs(fasta=cots.trim.contigs.trim.good.fasta)
count.seqs(name=cots.trim.contigs.trim.good.names, group=cots.contigs.pick.good.groups, processors=8)
summary.seqs(count=cots.trim.contigs.trim.good.count_table, processors=8)

### create personal alignment file ###
pcr.seqs(fasta=silva.seed_v119.align, start=11894, end=25319, keepdots=F, processors=8) #nice idea
system(mv silva.seed_v119.pcr.align silva.align.fasta)
summary.seqs(fasta=silva.align.fasta, processors=8)
######################################

# flip=t
align.seqs(fasta=cots.trim.contigs.trim.good.unique.fasta, reference=silva.align.fasta, flip=T, processors=8)
summary.seqs(fasta=cots.trim.contigs.trim.good.unique.align, count=cots.trim.contigs.trim.good.count_table, processors=8)

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 10179 200 0 3 1
2.5%-tile: 1968 11550 252 0 4 258852
25%-tile: 1968 11550 253 0 4 2588518
Median:  1968 11550 253 0 4 5177036
75%-tile: 1968 11550 253 0 5 7765553
97.5%-tile: 1968 11550 256 0 6 10095219
Maximum: 3796 13406 275 0 19 10354070
Mean: 1968.06 11550 252.978 0 4.54403
# of unique seqs: 1302064
total # of seqs: 10354070

screen.seqs(fasta=cots.trim.contigs.trim.good.unique.align, count=cots.trim.contigs.trim.good.count_table, summary=cots.trim.contigs.trim.good.unique.summary, start=1968, end=11550, minlength=250, maxlength=260, maxhomop=8, processors=8) 
summary.seqs(fasta=current, count=current, processors=8) 

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1961 11550 250 0 3 1
2.5%-tile: 1968 11550 252 0 4 258134
25%-tile: 1968 11550 253 0 4 2581336
Median:  1968 11550 253 0 4 5162671
75%-tile: 1968 11550 253 0 5 7744006
97.5%-tile: 1968 11550 256 0 6 10067207
Maximum: 1968 12069 260 0 8 10325340
Mean: 1968 11550 252.985 0 4.5437
# of unique seqs: 1289638
total # of seqs: 10325340

filter.seqs(fasta=cots.trim.contigs.trim.good.unique.good.align, vertical=T, trump=., processors=8)

Length of filtered alignment: 572
Number of columns removed: 12853
Length of the original alignment: 13425
Number of sequences used to construct filter: 1289638

unique.seqs(fasta=cots.trim.contigs.trim.good.unique.good.filter.fasta, count=cots.trim.contigs.trim.good.good.count_table)
summary.seqs(fasta=current, count=current, processors=8)

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 572 250 0 3 1
2.5%-tile: 1 572 252 0 4 258134
25%-tile: 1 572 253 0 4 2581336
Median:  1 572 253 0 4 5162671
75%-tile: 1 572 253 0 5 7744006
97.5%-tile: 1 572 256 0 6 10067207
Maximum: 2 572 260 0 8 10325340
Mean: 1 572 252.985 0 4.5437
# of unique seqs: 1289579
total # of seqs: 10325340

pre.cluster(fasta=cots.trim.contigs.trim.good.unique.good.filter.unique.fasta, count=cots.trim.contigs.trim.good.unique.good.filter.count_table, diffs=2, processors=8)

# I've forgot the summary here and the precluster fasta / count_table files are on my lab computer - but it goes down from ~ 1 300 000 to ~ 310 000 (see below, remove.seqs does not removes so much)

chimera.uchime(fasta=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.fasta, count=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.count_table, dereplicate=t, processors=8)

remove.seqs(fasta=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.fasta, accnos=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.uchime.accnos)
summary.seqs(fasta=current, count=current, processors=8)

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 572 250 0 3 1
2.5%-tile: 1 572 252 0 4 250192
25%-tile: 1 572 253 0 4 2501919
Median:  1 572 253 0 4 5003837
75%-tile: 1 572 253 0 5 7505755
97.5%-tile: 1 572 255 0 6 9757482
Maximum: 2 572 260 0 8 10007673
Mean: 1 572 252.982 0 4.53237
# of unique seqs: 304376
total # of seqs: 10007673

#classify.seqs(fasta=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.fasta, count=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=gg_13_8_99.fasta, taxonomy=gg_13_8_99.gg.tax, cutoff=60, processors=8)

classify.seqs(fasta=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.fasta, count=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=silva.nr_v119.align, taxonomy=silva.nr_v119.tax, cutoff=60, processors=8)

remove.lineage(fasta=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.fasta, count=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.nr_v119.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Eukaryota)
summary.seqs(fasta=current, count=current, processors=8)

  Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 572 250 0 3 1
2.5%-tile: 1 572 252 0 4 249380
25%-tile: 1 572 253 0 4 2493791
Median:  1 572 253 0 4 4987581
75%-tile: 1 572 253 0 5 7481371
97.5%-tile: 1 572 255 0 6 9725782
Maximum: 2 572 260 0 8 9975160
Mean: 1 572 252.982 0 4.53356
# of unique seqs: 303354
total # of seqs: 9975160

######### file name simplification ########

system(cp cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.pick.fasta cots.final.fasta)
system(cp cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table cots.final.count_table)
system(cp cots.trim.contigs.trim.good.unique.good.filter.unique.precluster.pick.nr_v119.wang.pick.taxonomy cots.final.taxonomy)

######### from here Otu stuff ##########

## cluster split 

cluster.split(fasta=cots.final.fasta, count=cots.final.count_table, taxonomy=cots.final.taxonomy, splitmethod=classify, taxlevel=4, cutoff=0.03, hard=t, method=furthest, processors=8)
make.shared(list=cots.final.fn.unique_list.list, count=cots.final.count_table, label=0.03)
classify.otu(list=cots.final.fn.unique_list.list, count=cots.final.count_table, taxonomy=cots.final.taxonomy, label=0.03)

Thank you in advance!

You have a lot of questions! Usually it’s best to break them up into separate threads, but let me try here…

1.) I have ~ 10 000 000 total sequences. After aligning I have ~ 1 300 000 unique sequences. The number of unique sequences goes down to ~ 300 000 after pre.cluster. I have no experience so far with MiSeq data - only with 454 - and so many reads. Hence, my first question is: does it look normal? Or is here anything which sets the alarm bells ringing?

That does seem high. Also, looking at your mothur commands you do this:

trim.seqs(fasta=cots.trim.contigs.fasta, oligos=cots.oligo, pdiffs=2, maxambig=0, minlength=200, maxlength=300, qwindowaverage=35, qwindowsize=50, processors=8) # won't create a group since you don't have any names

Which doesn’t make sense since the quality scores we now output won’t be compatible with trim.seqs - you also don’t provide the actual quality file so it likely isn’t doing anything. Why not give your oligos file in make.contigs? What type of environment are you sampling from?

3.) After classify.seqs and remove.lineage (gg or silva) I discovered that around ~ 3 500 000 of my ~ 10 000 000 sequences are unclassified bacteria. Ugh. The rest of the taxonomy looks fine to me. Heaps of Proteobacteria, nothing unexpected or ‘out of the order’. But that large number of unclassified is somehow terrifying to me.

you mean they’re “Bacteria;unclassified”?


If you could split up the questions and give more information that would be great.

Pat