Working out proceedure for nanopore 16s

Creating this thread for my questions that arise when working through nanopore data because that could be useful for others. I’ll write up an SOP if the end result is worth others repeating.

Background. I’ve finally gotten my gridion and started sequencing. I have a few zymo mocks sequenced with nanopore’s 16s kit on R9.? (too high error to use mothur), 8f/1391R with nanopore PCR barcoded on R10.4, plus I’ve run them on MiSeq using standard v4 sequencing with EMP version of 515f/806R.

I’m trying to get the R10.4 data through mothur but I’m not super confident it’s going to give me anything useful. There are very few non-unique sequences which suggests too high error rates. This is still running (4 days and counting on 32 cores 500gb ram).

I’ve also pcr.seq selected out just v4 from this data and am trying to run mothur on that section to get a better idea of how this compares to illumina.

Finally, I’m going to repeat both these analyses on just the duplex data which is ~10% of the R10.4 run.

I’m presenting all this at ASM Microbe next month, so I have a deadline to get this done!

First few issues:
trimming data: I’m using PCR seqs with pdiffs/rdiffs = 3 and 5

Here’s the batch that I’m trying to run. I’m sure that I’ll have to edit this as I find more errors. I’m on a high mem node, using 500gb

mothur "#summary.seqs(fasta=oct23nanoporeRecall.fasta, processors=32);
pcr.seqs(fasta=current, count=oct23nanoporeRecall.count_table, oligos=oligos8f1391r.txt, pdiffs=3, rdiffs=3, checkorient=T);
summary.seqs(fasta=current, count=current);
screen.seqs(fasta=current, maxambig=0, minlength=800, maxhomop=8);
unique.seqs(fasta=current);
summary.seqs(fasta=current, count=current);
align.seqs(fasta=current, reference=silva.nr_v138_1.align);
summary.seqs(fasta=current, count=current);
filter.seqs(fasta=current, vertical=T);
pre.cluster(fasta=current, count=current, diffs=14);
summary.seqs(fasta=current, count=current);
classify.seqs(fasta=current, count=current, reference=silva.nr_v138_1.align, taxonomy=silva.nr_v138_1.tax, cutoff=80);
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Eukaryota);
summary.tax(taxonomy=current, count=current);
dist.seqs(fasta=current, countends=F, cutoff= 0.03, processors=16);
cluster(column=current, count=current, method=opti);
summary.seqs(processors=32);
make.shared(list=current, count=current);
classify.otu(list=current, count=current, taxonomy=current);
get.oturep(fasta=current, count=current, list=current, method=abundance);
count.groups(shared=current);
rarefaction.single(shared=current);
summary.single(shared=current, calc=nseqs-sobs-coverage-shannon-shannoneven-invsimpson, subsample=10000);

Thanks for doing this Kendra. Since you’re at a core facility that has done both the Kozich protocol and nanopore sequencing, you’ll likely be able to answer this. Starting with genomic DNA, what do you estimate the cost to be to generate sequence data and how many reads of sequence data are being generated?

We were able to get very good PacBio full length sequence data that was as good as our Kozich protocol, but it would have been prohibitively expensive to do at the same scale. In my ranking quality is more important than cost, but cost is still important.

Thanks,
Pat

Quick question: how long does it take and on what type of computer? Can you also tell us how many samples you run per flow cell?

Thanks!

Cost is so dependent on throughput. If I can get enough data for 48 samples/flowcell, the cost will be very close to miseq which is my goal. I’m defining “enough data” as being at least a few thousand seqs as high quality as illumina since that’s what people expect. If I can’t get that, I’m going to wait till the next gen of flowcells/basecallers before launching the service.

We (royal we, the other UConn sequencing facility) now had a Sequal II that is very underutilized because they have a Revio. If nanopore really isn’t ready for amplicons, I’ll probably turn protocol dev towards pacbio. I just fundamentally want to get long reads to work on a $ sequencer rather than a $$$$$ sequencer.

As you know, one of the major hurdles in getting “enough” data for a bunch of samples is loading ~equal DNA for each barcode. I had an undergrad this past year working out a spri bead clean protocol for that. We make the PEG solution and dilute commercial beads with that to limit the amount of DNA bound. I have 96 nanopore barcodes ready to test with this method. hopefully this week.

1 Like

@Alexandre_Thibodeau I’ve been playing with a test run that is 24 samples. I got ~8Gb but throw out 75% when matching primers. Those ~2.2 M reads never finished running using 32 cores and 500Gb ram. Dist.seqs (processors=2) ran for 2 weeks and I gave up (aka had other projects that I needed to get on that computer). Interestingly, I also trimmed out just v4 and it also ran for 2 weeks on dist.seqs and didn’t finish. I was able to run both the full length and trimmed v4 through dada2 but the results were nonsense.

I’ve pulled out just the duplex reads from this run ~0.5M reads. they ran through mothur overnight using 500gb ram. That just finished, I’ll look at the results today and see if they look more reasonable.

well boo. the duplex reads aren’t much better. I think this is going to require a deeper dive into the gnitty gritty of nanopore sequencing and basecalling.

If anyone’s at ASM, I’m presenting this tomorrow MBP-SATURDAY-651

Here’s my poster if you’re interested.

A number of ONT FAS came to talk to me, all insisting that there was a problem with my bioinformatics that was making the data look bad. I’m going to talk to some of them later this to see what the computational FAS think. They do have a new basecaller in the past month, so I’m trying to get that rerun.Though honestly, I’m going to be worried if it changes the calls so much that it drops the error rate significantly.

@pschloss I ran seq.error but don’t know how to use the output beyond that one high level overall rate. Do you have scripts for what to do with it? Here’s the report for in silico v4 (pulled from the full sequences with pcr.seqs)

Errors Sequences
0 787266
1 3872
2 568
3 47
4 14
5 8
6 4
7 1

And here’s the summary for the full length

Errors Sequences
0 20336
1 4031
2 459
3 18
4 16
5 31
6 0
7 0
8 4
9 1
10 0
11 0
12 0
13 3
14 0
15 0
16 0
17 0
18 0
19 0
20 1
21 0
22 0
23 1

@graciella here’s what I’m trying