Creating a database

Alexandre_Thibodeau · July 22, 2021, 3:03pm

Hello all, I kinda have a lot of different 16S runs from chickens, swine, slaughter house, in vitro studies, etc.

I just want a way to put that all together and interrogate the database to be able to take the full use of all that spent money.

Here is what I have though of.

Independently, use Mothur on each experiment. The idea is when a new experiment is added to the database, Mothur do not need to be run from scratch. I would run Mothur just prior to making OTUs

Then, put all “cleaned” experiments together, using the merge command or add incoming new experiments to the “core file”.

run unique on the core file

make OTU from the core file

classify said OTU from the core file

Use the final classification and shared file for analysis purpose in a database from where I can pull out data according to certain metadata and run what I want to do from there: simply compare a new experiment with existing data, build up on a certain research topic or do machine learning for diagnostic purpose, etc.

Do you believe this is possible/useful?

sje062 · July 23, 2021, 12:22pm

Hi,

yes I believe this is possible/useful. Keep core files (fasta, name, group) add new experiments (fasta, name group) using merge.files.

Things to think about; subsampled size of core (to be lower than any new experiment added), methods used may cause variation (combining experiments with different DNA extraction methods, different PCR primers…). We did something similar adding from literature.

Sigmund

pschloss · July 23, 2021, 5:59pm

Another issue with pooling different datasets prior to clustering is that if you run filter.seqs separately, then you will have different alignments for each dataset. That will make it difficult to calculate distances between sequences from different datasets.

Be on the look out for a preprint from us describing cluster.fit, which will make more robust OTU assignments for an open reference clustering approach. Hopefully, this will be available in the next month.

Pat

Alexandre_Thibodeau · July 23, 2021, 6:53pm

Thanks sje062 and Dr Schloss.

My idea is simply to reduce computer use. But I do get your point. So I could do separately everything until alignment with Silva, pool together the different experiences, run unique, run align.seqs and so on.

That will be heavy for the computer. Fortunately, most of my database will be swine and chicken, so at some point the number of unique sequences found should stabilize and therefore limit the size of the alignment and the following distance matrix.

Can’t wait to try the new function, hope it will solve some issue that I had with my disappearing Salmonella.

And by the way, thanks for the Riffomonas videos, I am currently viewing the videos about mikropml, which gave me the idea and more importantly the will to spend the time to try to put together the aforementioned database.

Cheers!

system · August 2, 2021, 6:53pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster.fit SOP mothur bugs	10	432	April 15, 2022
Combining taxonomy table fro different dataset Commands in mothur	6	1196	January 31, 2017
Correspondance between OTU numbers between runs Commands in mothur	7	322	September 23, 2022
more otus by processing datasets together than separate mothur bugs	1	2767	April 3, 2012
Combining sequence datasets Theory behind mothur	5	1446	December 16, 2018

Creating a database

Related topics