Cluster.fit SOP

Hello!

As I am trying to merge different runs into a hudge database, using cluster.fit seems like the only solution for me given the not unlimited computer power at my disosal. So I first tried this:

" mothur > merge.files(input=reference.fasta-query.fasta, output=combined.fasta)
mothur > merge.count(count=reference.count_table-query.count_table, output=combined.count_table)
mothur > dist.seqs(fasta=combined.fasta, cutoff=0.03)
"

But i got an error message telling me that dist.seq did not complete due to reads not having the same lenght.

I then run unique on the merge file to try to unify the files. No new unique was found, which is quite strange to me.

Finally, I went into another direction which is the following:

merge
align
screen
filter
unique
classify.seq
distance

then: (to come)
cluster.fit(reflist=megacampy.trim.contigs.unique.good.good.filter.unique.precluster.denovo.vsearch.pick.opti_mcc.list, column=current, fasta=current, count=current, delta=0, iters=1000, cutoff=0.02, processors=32)

to come: make.shared(list=current, count=current, label=0.02)

to come: classify.otu(list=current, count=current, taxonomy=current, label=0.02,threshold=78)

and onwards with analysis.

This time around unique numbers decreased after the unique step and the distance is being calculated. I will run cluster.fit afterwards when i get the names correctly of the created files for the correct input command.

Do you believe that all this is needed before making it to cluster.fit? Would simply run screen after merging, then go for unique, then continue with the rest would work?My fear is also that since I used unique (feels logic to me) that this will impair cluster.fit because data might not match perfectly to the reference list I am giving Mothur.

Cheers!

Hello!

What I feared actually happened at the make.shared step. So when merging the files, you cannot run unique otherwise the list file (that probably contain some reference OTU that disseapered in the count file after uniquee) do not match. On a good note, Cluster.fit did worked without problem.

mothur > make.shared(list=current, count=current, label=0.02)
Using combinedphyto.good.filter.count_table as input file for the count parameter.
Using combinedphyto.good.filter.unique.optifit_mcc.list as input file for the list parameter.
[ERROR]: M02509_129_000000000-AMDEF_1_1101_10106_14190 is in your groupfile and not your listfile. Please correct.
[ERROR]: M02509_129_000000000-AMDEF_1_1101_10201_17028 is in your groupfile and not your listfile. Please correct.

I just though it also migh be the screen.seq step that is deleting things.

Is there a way to have mothur make a list file that is in par with the count file?

So, after taking a step back, I decided to rerun my old data set that serve as for basis for the reference for cluster.fit, running the exact same parameters as my new experiment. After that, I will merge the files and see what happens and this without modifying the merge file further.

Still, it is a bit anoying if you must alway rerun the reference database everytime you start using a new version of Mothur.

What I want to acheive is to merge several big data set together at some point so that all new experiments gets merges into the full database everytime I sequence something, this growing the database each time untill infinity and beyond! This is mandatory since I cannot run more then 600 to 700 samples at the same time because of computer restriction. I am aiming for a database that for now would include something around 1500 chicken samples from my lab + whatever I can fetch online or during collaborative work.

Cheers, I will keep posting on my adventures with this project.

Ok, after rerunning everything, I am still getting the same error right after merging my 2 fasta and my 2 count file.

mothur > dist.seqs(fasta=current, cutoff=0.02, processors=32)
Using combinedphyto.fasta as input file for the fasta parameter.

Using 32 processors.
[ERROR]: your sequences are not the same length, aborting.

any idea?

Thanks again,

Alright, still not able to complete this.

What I want to do is the folowing

dataset1 (past runs) + dataset 2 (past runs) = dataset 3

dataset 3 + dataset 4 (new run) = dataset 5

and so on as new runs enters the lab, untill infinity!

So faut, I have tried several approach, but I am stuck at the make.shared.

It says that there are sequences in my count file that are not in the list. It feels like the output list from cluster.fit does not contain all that there are in the count file I am providing it which is weird.


mothur > cluster.fit(reflist=megacampy.trim.contigs.unique.good.good.filter.unique.precluster.denovo.vsearch.pick.opti_mcc.list, column=current, fasta=current, count=current, delta=0, iters=1000, cutoff=0.02, processors=32)

Using combinedphyto.good.filter.unique.dist as input file for the column parameter.
Using combinedphyto.good.filter.count_table as input file for the count parameter.
Using combinedphyto.good.filter.unique.fasta as input file for the fasta parameter.

Output File Names:
combinedphyto.good.filter.unique.optifit_mcc.sensspec
combinedphyto.good.filter.unique.optifit_mcc.steps
combinedphyto.good.filter.unique.optifit_mcc.list

mothur > make.shared(list=current, count=current, label=0.02)
Using combinedphyto.good.filter.count_table as input file for the count parameter.
Using combinedphyto.good.filter.unique.optifit_mcc.list as input file for the list parameter.

[ERROR]: M02509_129_000000000-AMDEF_1_1101_10106_14190 is in your groupfile and not your listfile. Please correct.

Is it the fasta file that do not fit the count file? I do not know.

Or maybe I am not using the right command? I will try another command for cluster. fit:

1)Run second_dataset like always except for clustering

2)Fit second database to the first one
mothur > cluster.fit(fasta=second_dataset.fasta, column=second_dataset.dist, count=second_dataset.count_table, reffasta=first_dataset.fasta, refcolumn=first_dataset.dist, reflist=first_dataset.list, delta=0, iters=1000, cutoff=0.02)

My question would be what is the output of this? Does it output “merged” files? Guess I will se in 24 hours ,whre I run

Keep you posted.

I am happy to help. The filtering step is likely what’s causing the issue with the lengths not being the same. Here’s what I recommend:

mothur > merge.fasta(input=dataset1.fasta-dataset2.fasta-…-datasetn.fasta, output=merged1.fasta) - merge a subset of your dataset that will process in a reasonable amount of time

mothur > merge.count(count=dataset1.count_table-dataset2.count_table-…-datasetn.count_table, output=merged1.count_table) - merge a subset of your dataset that will process in a reasonable amount of time

mothur > unique.seqs(fasta=merge1.fasta, count=merge1.count_table) - merge identical reads and update count file

mothur > align.seqs(fasta=merge1.fasta, reference=yourReferenceFile) - align reads

mothur > screen.seqs(fasta=current, count=current, … other parameters… )

mothur > filter.seqs(fasta=current, vertical=t, trump=.) - filter reads

mothur > unique.seqs(fasta=current, count=current) - merge identical reads created after filtering

mothur > pre.cluster(fasta=current, count=current, diffs=2) - combine reads with diffs<=2

mothur > chimera.vsearch(fasta=current, count=current, dereplicate=t) - remove chimeras

mothur > classify.seqs(fasta=current, count=current, …other parameters …)

mothur > remove.lineage(fasta=current, count=current, taxonomy=current, …other parameters…) - remove contaminants

mothur > cluster.split(fasta=current, count=current, taxonomy=current, runsensspec=t) - create list file and column distance file

Now to fit another set of datasets to merge1:

mothur > merge.fasta(input=datasetn+1.fasta-datasetn+2.fasta-…-datasetn+m.fasta, output=merged2.fasta) - merge a subset of your dataset that will process in a reasonable amount of time

mothur > merge.count(count=datasetn+1.count_table-datasetn+2.count_table-…-datasetn+m.count_table, output=merged2.count_table) - merge a subset of your dataset that will process in a reasonable amount of time

mothur > unique.seqs(fasta=merge2.fasta, count=merge2.count_table) - merge identical reads and update count file

mothur > align.seqs(fasta=merge2.fasta, reference=yourReferenceFile) - align reads

mothur > screen.seqs(fasta=current, count=current, … other parameters… )

mothur > filter.seqs(fasta=current, hard=merg1.filter) - filter reads using filter from first dataset to ensure same length and column used in alignment

mothur > unique.seqs(fasta=current, count=current) - merge identical reads created after filtering

mothur > pre.cluster(fasta=current, count=current, diffs=2) - combine reads with diffs<=2

mothur > chimera.vsearch(fasta=current, count=current, dereplicate=t) - remove chimeras

mothur > classify.seqs(fasta=current, count=current, …other parameters …)

mothur > remove.lineage(fasta=current, count=current, taxonomy=current, …other parameters…) - remove contaminants

mothur > dist.seqs(fasta=current, cutoff=0.03) - create distance matrix for merge2.fasta

mothur > cluster.fit(fasta=current, count=current, column=current, reflist=listFileFromMerge1, refcount=countFileFromMerge1, refcolumn=columnMatrixfromMerge1) - fits sequences from merge2 into otus in merge1, any reads unable to be fitted will be clustered into new OTUs.

mothur > merge.file(input=columnMatrixfromMerge1-columnMatrixfromMerge2, output=merge12.column) - use merge12.column as refcolumn in next cluster.fit

mothur > merge.count(input=countFileFromMerge1-countfileFromMerge2) - combine count files to create new refcount for use in next cluster.fit

mothur > rename.file(list=current, new=merge12.list) - rename new reflist file for use in next cluster.fit

Repeat for all remaining sets of datasets, always filtering using the merge1.filter.

ohhh. Thakns you for your input!

I will toy with that!

Based on your input, I will start by redoing cluster.split on the merge files and from there make my shared file. This option is the fastest for me now. As my database grow bigger, the second part of the pipeline will become handy!

My only question: how do you create the merg1.filter file?

Thanks a lot!

Mothur generates the merge1.filter file as an output from the filter.seqs command. It can be used with the hard parameter to indicate which columns to filter from the alignment.

Victory! Thanks you so much! Now I can start pulling all my runs into the database. Next step: machine learning and bacterial network!

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.