Cluster.split taxlevel

Hello
I don’t understand how taxonomy level (taxlevel option) is used to split distances file in the cluster.split command. Could somebody please explain it or send me somewhere I could get the explanation ?
I made a small go at looking what happen when I change taxlevel from 6 to 7 with a small dataset using a species-level well curated database. The results I get is few more hits (reads iDed) for taxlevel=6 that I don’t get into the taxlevel=7, and much less the other way around (ie hits that I get in taxlevel=7 and that I don’t with taxlevel=6). I would have expected taxlevel=7 to perform better here.
Thanks for your help!
David

Hi David,

It takes your taxonomy strings and sends sequences to different groups based on the level with the levels separated by semi-colons. You would need each level to be represented for all of the sequences to get it to perform correctly. Do you have a minimum reproducible example that you could post that shows what you’re seeing? I could take a look at it and let you know what I see…

Pat

Hi Pat
Thanks for your answer. So it first groups seqs according to the taxlevel option, and then makes clusters at the chosen cutoff into each of these ?

Here is the shape of my ref library, all levels are populated - in fact I take a wider dataset and I go for a small taxonomic part with a custom library.
MNHNL145207 Eukaryota;Annelida_6340;Clitellata_42113;Haplotaxida_6382;Lumbricidae_6392;Aporrectodea;A_caliginosa_L2;
MNHNL146587 Eukaryota;Annelida_6340;Clitellata_42113;Haplotaxida_6382;Lumbricidae_6392;Aporrectodea;A_caliginosa_L3;
MNHNL146567

Here are the command lines I use to get there (maybe the issue originates from here)

classify.seqs(fasta=mergedfastaunique, count=mergedfastacount_table, cutoff=90, reference=lumbref.fas, taxonomy=lumb5.txt)

cluster.split(fasta=current, count=current, taxonomy=current, taxlevel=6, cutoff=0.03, runsensspec=f, processors=32)

make.shared(list=current, count=current, label=0.03)

classify.otu(list=current, count=current, taxonomy=current, label=0.03)

merge.otus(constaxonomy=current, list=current)

Here is the a table of the few discrepancies I get between the two taxlevel (bottom rows are just different with count):

taxlevel=6 taxlevel=7
sample_id otu count genus taxon sample_id otu count genus taxon
HS_D_M_9_6 Otu050790 1 Lumbricus L_rubellus_L2
HS_Esc_M_10_4 Otu002940 19 Lumbricus L_rubellus_LP1
HS_H_M_10_4 Otu013040 1 Aporrectodea A_caliginosa_L2
HS_H_M_6_14 Otu050790 1 Lumbricus L_rubellus_L2
HS_H_M_7_26 Otu002940 2 Lumbricus L_rubellus_LP1
HS_W_M_7_12 Otu053147 1 Octolasion Octolasion_sp._BIOUG32056_A02_2474609
M_BE_M_5_3 Otu072321 2 Aporrectodea A_icterica
M_BI_M_10_4 Otu002940 86 Lumbricus L_rubellus_LP1
M_BI_M_6_28 Otu002940 1 Lumbricus L_rubellus_LP1
M_BI_M_8_24 Otu013040 1 Aporrectodea A_caliginosa_L2
M_E_M_7_26 Otu002940 1 Lumbricus L_rubellus_LP1
M_H_M_6_14 Otu013040 3 Aporrectodea A_caliginosa_L2
M_H_M_6_28 Otu002940 1 Lumbricus L_rubellus_LP1
M_H_M_9_20 Otu020357 1 Aporrectodea Aporrectodea_unclassified
M_HER_M_5_3 Otu002940 1 Lumbricus L_rubellus_LP1
O_A_M_10_4 Otu020357 1 Aporrectodea Aporrectodea_unclassified
O_A_M_6_28 Otu053147 1 Octolasion Octolasion_sp._BIOUG32056_A02_2474609
O_G_M_6_14 Otu002940 1 Lumbricus L_rubellus_LP1
O_HAM_M_5_17 Otu002940 1 Lumbricus L_rubellus_LP1
O_HAM_M_9_6 Otu050790 1 Lumbricus L_rubellus_L2
O_HE_M_7_12 Otu014656 1 Aporrectodea A_caliginosa_L2
O_HE_M_9_20 Otu020357 1 Aporrectodea Aporrectodea_unclassified
O_HO_M_5_3 Otu020357 2 Aporrectodea Aporrectodea_unclassified
O_HO_M_7_26 Otu021795 1 Dendrobaena Dendrobaena_unclassified
O_HU_M_6_14 Otu053316 4 Lumbricus L_rubellus_L2
O_V_M_8_23 Otu021795 1 Dendrobaena Dendrobaena_unclassified
O_V_M_8_9 Otu020357 1 Aporrectodea Aporrectodea_unclassified
O_V_M_8_9 Otu021795 1 Dendrobaena Dendrobaena_unclassified
M_H_M_7_26 Otu002940 30 Lumbricus L_rubellus_LP1 M_H_M_7_26 Otu003061 28 Lumbricus L_rubellus_LP1
M_HER_M_9_20 Otu020357 10 Aporrectodea Aporrectodea_unclassified M_HER_M_9_20 Otu023121 8 Aporrectodea Aporrectodea_unclassified
O_G_M_8_9 Otu013040 25 Aporrectodea A_caliginosa_L2 O_G_M_8_9 Otu014656 21 Aporrectodea A_caliginosa_L2
O_HAC_M_5_3 Otu013040 9 Aporrectodea A_caliginosa_L2 O_HAC_M_5_3 Otu014656 4 Aporrectodea A_caliginosa_L2
O_HAU_M_8_9 Otu002940 24 Lumbricus L_rubellus_LP1 O_HAU_M_8_9 Otu003061 21 Lumbricus L_rubellus_LP1
O_HAU_M_9_20 Otu013040 8 Aporrectodea A_caliginosa_L2 O_HAU_M_9_20 Otu014656 7 Aporrectodea A_caliginosa_L2
O_HE_M_6_14 Otu002940 40 Lumbricus L_rubellus_LP1 O_HE_M_6_14 Otu003061 38 Lumbricus L_rubellus_LP1

David

It looks like these are consensus taxonomies for each OTU. cluster.split uses the output from classify.seqs not the consensus taxonomies. I wonder if this is part of the problem.

Pat

Hi Pat

I ran analyses with taxlevel 6 and 7 comparing results when using concensus taxonomy ouputs from classify.otu and merge.otus. For both taxlevels, ouput from classify.otu performs way better with a lot more hits (ie species occurences in sites). So indeed there was a concensus taxonomy issue, when using merge.otus. Why do I loose that much hits merging OTUs with the same taxonomy ? Did I make a mistake when I used it ?

Thanks for your help
David

To be honest, you might be the first person to ever use merge.otus and I’m not really sure what it does :upside_down_face:

Please use cluster.split as described in the MiSeq SOP. That is how we intend it to be used.

Pat

ok :slight_smile:, it is supposed to “combine OTUs based on taxonomic assignment.” I see Sarah Westcott implemented it, I’ll try to contact her. I probably used it badly.
My goal was, for big datasets, to reduce the size of the file stitching the shared and taxonomy files for downstream analyses. It allowed to run them on a regular machine instead of a server with lots of ram.

One last question on the taxlevel thing (doing it according to the SOP, no merging): comparing concensus taxonomy from taxlevel 6 and 7, level 7 performs mostly better but in a few cases I have matches in level 6 that I don’t in 7. Any idea why ?

Thanks
David

What do you mean by performs “mostly better”? If you still have questions about the algorithm, I’d encourage you to check out the papers describing the method and its performance:

https://journals.asm.org/doi/10.1128/mspheredirect.00073-17
https://journals.asm.org/doi/10.1128/aem.02810-10

If you want to combine OTUs by taxonomy, then you’d be better off using the phylotype command to create phylotype-based OTUs. There wouldn’t be any need to go through the de novo OTU generation process.

Pat

I meant that I generally obtain more OTUs matches among sites with taxlevel=7 than with 6. But, in a few cases, I obtain matches in 6 that I don’t get in 7 which I find puzzling .
Thanks for the papers and also the phylotype advice, I’ll give it a go.
David

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.