Creating a database

Hello all, I kinda have a lot of different 16S runs from chickens, swine, slaughter house, in vitro studies, etc.

I just want a way to put that all together and interrogate the database to be able to take the full use of all that spent money.

Here is what I have though of.

Independently, use Mothur on each experiment. The idea is when a new experiment is added to the database, Mothur do not need to be run from scratch. I would run Mothur just prior to making OTUs

Then, put all “cleaned” experiments together, using the merge command or add incoming new experiments to the “core file”.

run unique on the core file

make OTU from the core file

classify said OTU from the core file

Use the final classification and shared file for analysis purpose in a database from where I can pull out data according to certain metadata and run what I want to do from there: simply compare a new experiment with existing data, build up on a certain research topic or do machine learning for diagnostic purpose, etc.

Do you believe this is possible/useful?


yes I believe this is possible/useful. Keep core files (fasta, name, group) add new experiments (fasta, name group) using merge.files.

Things to think about; subsampled size of core (to be lower than any new experiment added), methods used may cause variation (combining experiments with different DNA extraction methods, different PCR primers…). We did something similar adding from literature.


Another issue with pooling different datasets prior to clustering is that if you run filter.seqs separately, then you will have different alignments for each dataset. That will make it difficult to calculate distances between sequences from different datasets.

Be on the look out for a preprint from us describing, which will make more robust OTU assignments for an open reference clustering approach. Hopefully, this will be available in the next month.


1 Like

Thanks sje062 and Dr Schloss.

My idea is simply to reduce computer use. But I do get your point. So I could do separately everything until alignment with Silva, pool together the different experiences, run unique, run align.seqs and so on.

That will be heavy for the computer. Fortunately, most of my database will be swine and chicken, so at some point the number of unique sequences found should stabilize and therefore limit the size of the alignment and the following distance matrix.

Can’t wait to try the new function, hope it will solve some issue that I had with my disappearing Salmonella.

And by the way, thanks for the Riffomonas videos, I am currently viewing the videos about mikropml, which gave me the idea and more importantly the will to spend the time to try to put together the aforementioned database.


1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.