cluster.split taxonomy question

Hi Pat and Sarah,

Thanks for your previous help with my PacBio 18S data. I had some problems with the cluster.split crashing so I contacted PacBio and was able to obtain their Python script to successfully filter out poor quality FASTQ reads. Now the data looks better and I have been able to use cluster.split but I have a very quick (and possibly naive) question about taxonomy levels. I apologise in advance, I looked in the forum and papers but I am confused about whether the taxonomy level specified in this command changes the OTU based analyses. I have been able to successfully run c. 120,000 unique sequences with tax level=3 and large = T (only works when large=T). Would tax level =4 or 5 give a better resolution or does it make no difference at this stage?

Thanks in advance for clarifying this for me,

Bethan

I have been able to successfully run c. 120,000 unique sequences with tax level=3 and large = T (only works when large=T).

This is surprising - with average neighbor, large is a disaster - your memory usage will explode.

Would tax level =4 or 5 give a better resolution or does it make no difference at this stage?

The idea of using cluster.split is that the output would be the same as if you had used cluster, but that the command actually runs without blowing up your memory. Going to 4 or 5 should make things faster and use less memory. But as you go to 5 or 6, you run the risk that a genus will be less than 0.03 across and that you’ll wind up splitting an OTU that would have been one OTU by cluster.

Hi Pat,

I apologize, yes that would be surprising, but I prematurely wrote that when it was running and thought it was working well. It totally exploded. So I ran it at taxlevel 4, large=T and it worked well. Thanks for your response. I am proceeding to use the data from this and am presuming that it does not affect the downstream OTU analyses.

Bethan