I want to analyze my data of a Titanium 454 run containing 16S V1-V3 region amplicons with an average length of ~600 bp. As the amplicon length is longer than the maximum read length, the data set contains sequences that have varying read lengths.
As a result when using the unique.seqs command I only get very few identical sequences (many sequences that are nearly identical but slightly differ in read length). Consequently, I end up with a unique.fasta-file that still contains 200,000-300,000 sequences. Clustering then becomes a problem, mothur is running for days and without finishing the distance matrix already has reached 30 Gb.
My question is whether it is possible to collapse sequences that are identical over their full length to a longer sequence. In other words I would like to consider sequences that are identical but differ slightly in length as identical and include only the longest of such sequences in my unique.fasta-file?
Is this currently possible with mothur or can it be made possible? Or do you have any other suggestions to overcome this problem?