too long Align.seqs output compared to read length

I am following the miseq SOP procedure to analyse my 16S Illumina data. I have reads of about 430 bp length, but when I process the align.seqs command, I obtain a very long alignment of 1178 nt. First I didn’t notice this problem but then, when arriving at the dist.seqs command turn for a long long time until generating a huge distance matrix file of more than 200 Go. What can I modify to optimize the alignment of my sequences? Many thanks for your help.
Best
Aurelie

Hello,

You should read this blog post which discusses the problem you are having.

Cheers
Richard

Thanks, I read this blog page but it depressed me, because it’s suggested to totally re-sequence the sample. But I don’t want to through away all my data!

how many samples do you have?

I have 10 soil samples each in triplicates = 30 samples of very different origin.

With soils you are going to be pretty limited by the databases if you try phylotyping-even without the seq errors, you’re going to get a lot of “unclassified” which tell you nothing.

Feel free to shoot me an email. I could go from your extracted DNA to v4 sequences using Kozich primers (min 10k/sample) for $38/sample. mars at uconn.edu

So I am a little unclear at this point. Did dist.seqs finish, albeit with a very large distance file? And were you able to progress with the analysis? Or were you unable to continue with the analysis because of the large distance file?

If it’s the latter, the blog post I linked does offer some different ways that you may be able to get the data to work. Specifically, try running it through cluster.split with taxlevel=6, or instead use a phylotype approach where you just bin things according to their classifications. This latter option may not work so well for you though given your sample type as kmitchell says because many things will simply classify as unknown and you will lose a lot of data.

As frustrating as it is, resequencing at a shorter length, (e.g. V4 region only) is really the best option. I went through this myself for my first run, and really there is only so much you can do if your raw data is poor quality. If you want good results you need decent raw data. I think that with this in mind, kmitchell is offering you pricing for resequencing your samples with her, so that you can have good sequence data to play with.

Cheers
Richard

The dist.seqs command run for a long time, but I managed to get a .dist file, however when I performed cluster with a cutoff=0.03, at the end it said “changed cutoff to 0.00947584”, Then When I run the classify.otu with my unique_list.list file appear a new message “Your file does not include the label 0.03. I will use unique.” I seems that all my unique sequences were considered as one unique OTU?

For soils changing taxlevel doesn’t help much because many “uniques” will be unclassified beyond taxlevel 3 or 4 (so dropping to 6 doesn’t change anything). You could up your diffs when you precluster. ~400bp amplicon, I’d try diffs=6 (1.5%).

What version of mothur are you using? Up through 1.38, you need to set your cutoff well above your desired OTUs (so I’d try 0.09). Here’s what I’d do. You may need to drop the processors in the first cluster.split depending on your machine. I run on a server with 16 processors and 256GB RAM

#make otus for each Order individually, for very large datasets (hundreds of samples) you may need to decrease the taxlevel to 5 or even 6. If you use 6 you will likely only get 3% OTUs because the within group differences aren't always 5%

cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.09, processors=16, cluster=f)

#if your data is very diverse you may end up with .dist files that are too large for the number of processers (if you are using 4 processors your dist file needs to fit in RAM 4 times. For my system, I have to drop the processors in this step if my largest .dist is over 64GB)
cluster.split(file=current, processors=1)