cluster.split

I’m trying to run cluster.split but my dist file is 112 GB and the command was running for over 96 hours before I cancelled it (I didn’t think it should be running that long). I started preprocessing with over 350 000 sequences and I am down to just over 100 000. Is there something I’m missing? I’ve looked at other posts and apparently a 30 GB dist file is massive… (Sorry, very new to this).

Yeah, there’s no reason you should be working with 350,000 sequences as input. Have you looked at the Costello Analysis Example? It’s important to get the error correction correct because the sequencer basically acts like a random sequence generator. This makes downstream processing very difficult (as you have found) and artificially inflates the biodiversity of your samples.

Pat