Blast against PATRIC Pathogen Database?

Hi,

I wonder if we could further classify unique 16s sequences to detect bacterial species or OTUs that are potentially pathogenic in human gut (PMID: 21896772)?

I notice that they’ve already demonstrated a similar approach through the HMP (PMID: 22699609) and would like to know whether this is possible through Mothur?

Thank you.

Daniel

Absolutely - you would have to come up with the database, but you can certainly do it. I’d encourage you to take a look at our RDP and SILVA reference taxonomies and use that as a basis for yours. If you have questions or problems as you go along making this, let us know.

Hi Pat,

Thanks for your critical piece of advice! I subsequently went on and generated the following summary outputs from your Mothur commands to my best knowledge:

“patric.bacteria.ssu.fasta” (raw extracted small subunit ribosomal sequences of variable length)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 61 61 0 2 1
2.5%-tile: 1 96 96 0 3 688
25%-tile: 1 355 355 0 5 6872
Median: 1 1438 1438 0 5 13744
75%-tile: 1 1528 1528 0 6 20616
97.5%-tile: 1 1564 1564 0 7 26800
Maximum: 1 3038 3038 34 29 27487
Mean: 1 1018.75 1018.75 0.0294321 5.15946

of Seqs: 27487

“patric.bacteria.ssu.align” (screen.seqs using silva.bacteria.fasta)
Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1044 1815 68 0 3 688
25%-tile: 1044 15966 329 0 5 6872
Median: 1044 43116 1399 0 5 13744
75%-tile: 1053 43116 1463 0 6 20616
97.5%-tile: 40960 43116 1483 0 7 26800
Maximum: 43116 43116 1612 34 29 27487
Mean: 5065.95 32624.9 969.006 0.0274675 5.12144

of Seqs: 27487

“patric.bacteria.ssu.good.align”
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 43116 1375 0 4 1
2.5%-tile: 1044 43116 1407 0 5 283
25%-tile: 1044 43116 1450 0 5 2826
Median: 1044 43116 1464 0 6 5651
75%-tile: 1044 43116 1469 0 6 8476
97.5%-tile: 1044 43116 1486 0 7 11019
Maximum: 1044 43116 1612 0 8 11301
Mean: 1044 43116 1457.76 0 5.57694

of Seqs: 11301

“patric.bacteria.ssu.good.unique.align”
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 43116 1375 0 4 1
2.5%-tile: 1044 43116 1407 0 5 270
25%-tile: 1044 43116 1450 0 5 2699
Median: 1044 43116 1464 0 6 5398
75%-tile: 1044 43116 1468 0 6 8096
97.5%-tile: 1044 43116 1486 0 7 10525
Maximum: 1044 43116 1612 0 8 10794
Mean: 1044 43116 1457.61 0 5.58236

of Seqs: 10794

Finally, I used classify.seqs command (template = silva.bacteria.fasta, taxonomy=silva.bacteria.silva.tax) to generate wang.taxonomy files with probs=F and cutoff=80 parameters for the creation of patric reference taxonomy.

Judging from the sequence statistics above, do you think this approach is sound enough to make a quality database??

Thank you again!

Daniel

Looks good - If I were you, I would wonder what those sequences were that got chucked for being too short or not aligning well. You might look for full-length versions of those sequences elsewhere and then bring them into the db.

Pat