Cluster.split taxlevel

dporco · February 17, 2025, 10:10am

Hello
I don’t understand how taxonomy level (taxlevel option) is used to split distances file in the cluster.split command. Could somebody please explain it or send me somewhere I could get the explanation ?
I made a small go at looking what happen when I change taxlevel from 6 to 7 with a small dataset using a species-level well curated database. The results I get is few more hits (reads iDed) for taxlevel=6 that I don’t get into the taxlevel=7, and much less the other way around (ie hits that I get in taxlevel=7 and that I don’t with taxlevel=6). I would have expected taxlevel=7 to perform better here.
Thanks for your help!
David

pschloss · February 17, 2025, 3:26pm

Hi David,

It takes your taxonomy strings and sends sequences to different groups based on the level with the levels separated by semi-colons. You would need each level to be represented for all of the sequences to get it to perform correctly. Do you have a minimum reproducible example that you could post that shows what you’re seeing? I could take a look at it and let you know what I see…

Pat

dporco · February 18, 2025, 9:56am

Hi Pat
Thanks for your answer. So it first groups seqs according to the taxlevel option, and then makes clusters at the chosen cutoff into each of these ?

Here is the shape of my ref library, all levels are populated - in fact I take a wider dataset and I go for a small taxonomic part with a custom library.
MNHNL145207 Eukaryota;Annelida_6340;Clitellata_42113;Haplotaxida_6382;Lumbricidae_6392;Aporrectodea;A_caliginosa_L2;
MNHNL146587 Eukaryota;Annelida_6340;Clitellata_42113;Haplotaxida_6382;Lumbricidae_6392;Aporrectodea;A_caliginosa_L3;
MNHNL146567

Here are the command lines I use to get there (maybe the issue originates from here)

classify.seqs(fasta=mergedfastaunique, count=mergedfastacount_table, cutoff=90, reference=lumbref.fas, taxonomy=lumb5.txt)

cluster.split(fasta=current, count=current, taxonomy=current, taxlevel=6, cutoff=0.03, runsensspec=f, processors=32)

make.shared(list=current, count=current, label=0.03)

classify.otu(list=current, count=current, taxonomy=current, label=0.03)

merge.otus(constaxonomy=current, list=current)

Here is the a table of the few discrepancies I get between the two taxlevel (bottom rows are just different with count):

taxlevel=6					taxlevel=7
sample_id	otu	count	genus	taxon	sample_id	otu	count	genus	taxon
					HS_D_M_9_6	Otu050790	1	Lumbricus	L_rubellus_L2
HS_Esc_M_10_4	Otu002940	19	Lumbricus	L_rubellus_LP1
HS_H_M_10_4	Otu013040	1	Aporrectodea	A_caliginosa_L2
					HS_H_M_6_14	Otu050790	1	Lumbricus	L_rubellus_L2
HS_H_M_7_26	Otu002940	2	Lumbricus	L_rubellus_LP1
HS_W_M_7_12	Otu053147	1	Octolasion	Octolasion_sp._BIOUG32056_A02_2474609
M_BE_M_5_3	Otu072321	2	Aporrectodea	A_icterica
M_BI_M_10_4	Otu002940	86	Lumbricus	L_rubellus_LP1
M_BI_M_6_28	Otu002940	1	Lumbricus	L_rubellus_LP1
M_BI_M_8_24	Otu013040	1	Aporrectodea	A_caliginosa_L2
M_E_M_7_26	Otu002940	1	Lumbricus	L_rubellus_LP1
M_H_M_6_14	Otu013040	3	Aporrectodea	A_caliginosa_L2
M_H_M_6_28	Otu002940	1	Lumbricus	L_rubellus_LP1
M_H_M_9_20	Otu020357	1	Aporrectodea	Aporrectodea_unclassified
M_HER_M_5_3	Otu002940	1	Lumbricus	L_rubellus_LP1
O_A_M_10_4	Otu020357	1	Aporrectodea	Aporrectodea_unclassified
O_A_M_6_28	Otu053147	1	Octolasion	Octolasion_sp._BIOUG32056_A02_2474609
O_G_M_6_14	Otu002940	1	Lumbricus	L_rubellus_LP1
O_HAM_M_5_17	Otu002940	1	Lumbricus	L_rubellus_LP1
					O_HAM_M_9_6	Otu050790	1	Lumbricus	L_rubellus_L2
					O_HE_M_7_12	Otu014656	1	Aporrectodea	A_caliginosa_L2
O_HE_M_9_20	Otu020357	1	Aporrectodea	Aporrectodea_unclassified
O_HO_M_5_3	Otu020357	2	Aporrectodea	Aporrectodea_unclassified
O_HO_M_7_26	Otu021795	1	Dendrobaena	Dendrobaena_unclassified
O_HU_M_6_14	Otu053316	4	Lumbricus	L_rubellus_L2
O_V_M_8_23	Otu021795	1	Dendrobaena	Dendrobaena_unclassified
O_V_M_8_9	Otu020357	1	Aporrectodea	Aporrectodea_unclassified
O_V_M_8_9	Otu021795	1	Dendrobaena	Dendrobaena_unclassified
M_H_M_7_26	Otu002940	30	Lumbricus	L_rubellus_LP1	M_H_M_7_26	Otu003061	28	Lumbricus	L_rubellus_LP1
M_HER_M_9_20	Otu020357	10	Aporrectodea	Aporrectodea_unclassified	M_HER_M_9_20	Otu023121	8	Aporrectodea	Aporrectodea_unclassified
O_G_M_8_9	Otu013040	25	Aporrectodea	A_caliginosa_L2	O_G_M_8_9	Otu014656	21	Aporrectodea	A_caliginosa_L2
O_HAC_M_5_3	Otu013040	9	Aporrectodea	A_caliginosa_L2	O_HAC_M_5_3	Otu014656	4	Aporrectodea	A_caliginosa_L2
O_HAU_M_8_9	Otu002940	24	Lumbricus	L_rubellus_LP1	O_HAU_M_8_9	Otu003061	21	Lumbricus	L_rubellus_LP1
O_HAU_M_9_20	Otu013040	8	Aporrectodea	A_caliginosa_L2	O_HAU_M_9_20	Otu014656	7	Aporrectodea	A_caliginosa_L2
O_HE_M_6_14	Otu002940	40	Lumbricus	L_rubellus_LP1	O_HE_M_6_14	Otu003061	38	Lumbricus	L_rubellus_LP1

David

pschloss · February 18, 2025, 8:58pm

It looks like these are consensus taxonomies for each OTU. cluster.split uses the output from classify.seqs not the consensus taxonomies. I wonder if this is part of the problem.

Pat

dporco · February 20, 2025, 10:43am

Hi Pat

I ran analyses with taxlevel 6 and 7 comparing results when using concensus taxonomy ouputs from classify.otu and merge.otus. For both taxlevels, ouput from classify.otu performs way better with a lot more hits (ie species occurences in sites). So indeed there was a concensus taxonomy issue, when using merge.otus. Why do I loose that much hits merging OTUs with the same taxonomy ? Did I make a mistake when I used it ?

Thanks for your help
David

pschloss · February 20, 2025, 1:59pm

To be honest, you might be the first person to ever use merge.otus and I’m not really sure what it does

Please use cluster.split as described in the MiSeq SOP. That is how we intend it to be used.

Pat

dporco · February 21, 2025, 8:34am

ok , it is supposed to “combine OTUs based on taxonomic assignment.” I see Sarah Westcott implemented it, I’ll try to contact her. I probably used it badly.
My goal was, for big datasets, to reduce the size of the file stitching the shared and taxonomy files for downstream analyses. It allowed to run them on a regular machine instead of a server with lots of ram.

One last question on the taxlevel thing (doing it according to the SOP, no merging): comparing concensus taxonomy from taxlevel 6 and 7, level 7 performs mostly better but in a few cases I have matches in level 6 that I don’t in 7. Any idea why ?

Thanks
David

pschloss · February 21, 2025, 2:19pm

What do you mean by performs “mostly better”? If you still have questions about the algorithm, I’d encourage you to check out the papers describing the method and its performance:

https://journals.asm.org/doi/10.1128/mspheredirect.00073-17
https://journals.asm.org/doi/10.1128/aem.02810-10

If you want to combine OTUs by taxonomy, then you’d be better off using the phylotype command to create phylotype-based OTUs. There wouldn’t be any need to go through the de novo OTU generation process.

Pat

dporco · February 24, 2025, 10:23am

I meant that I generally obtain more OTUs matches among sites with taxlevel=7 than with 6. But, in a few cases, I obtain matches in 6 that I don’t get in 7 which I find puzzling .
Thanks for the papers and also the phylotype advice, I’ll give it a go.
David

system · March 6, 2025, 10:23am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cluster.split taxonomy question Commands in mothur	2	3063	August 20, 2013
Taxlevel argument in cluster? Commands in mothur	2	421	April 12, 2020
cluster.split no taxonomy Commands in mothur	1	2039	April 30, 2013
Cluster.split by classification Commands in mothur	1	1682	January 22, 2014
Cluster vs Cluster Split Commands in mothur	3	4724	August 13, 2014

Cluster.split taxlevel

Related topics