I am analysing a MiSeq dataset following the SOP and have got as far as generating the shared file so have now moved over to the 454 SOP.
I am here:
“First we need to subsample the sequences from each group and then construct a phylip-formatted distance matrix, which we calculate with dist.seqs”. But the command which follows does not seem to involve a subsampling step as mentioned above?
A secondary issue is that if I run dist.seqs on the non-subsampled fasta file, it takes days to run. I have c490000 sequences and they are c500 bp. Is there any way to speed it up?
Many thanks for your help,
William
“First we need to subsample the sequences from each group and then construct a phylip-formatted distance matrix, which we calculate with dist.seqs”. But the command which follows does not seem to involve a subsampling step as mentioned above?
Sorry about that, that was an old typo that has been corrected. You will subsample if/when you run the unifracs or phylo.diversity commands
A secondary issue is that if I run dist.seqs on the non-subsampled fasta file, it takes days to run. I have c490000 sequences and they are c500 bp. Is there any way to speed it up?
Yeah, don’t use MiSeq to generate 500 bp contigs. The problem is that your error rate is much much higher than you can get with 454 or in having fully overlapping sequencing reads. If you look at Kozich et al (2013) you’ll see that as you decrease the overlap between the reads your error rate skyrockets. If you are unable to get OTUs to work, I generally suggest using a phylotype-based approach. Sorry!
Pat
Thanks. These were 2 x 300 bp reads and so we were hoping for a sufficiently good overlap, and did quality filter the reads too. However, I guess that there is just still too much error as seen by the relatively high ratio of unique to total sequences.
William