pca timeout

I am trying the pca command on 4 different data sets. Each data set contains 14 groups. There are two data sets using v4 illumina tag sequences and two using the v6 illumina tag sequences. Each of the two (v4 and v6) data sets are 10,000 sequences per group and 230,000 sequences per group. When I run pca for the 10,000 sequences, both the v4 and v6 data sets work well. However, at the 230,000 sequences level, only the v4 set works. The v6 run has timed out twice now. Once after 5 days and once after 10 days while the v4 set only took a few hours. This is completely opposite to everything else I have run in mothur where the v4 runs are consistently longer than the v6 runs due to the sequences averaging about 250 bp vice the 60ish bp of the v6 tag. Is there an algorithm reason this is occurring (more difficulty with shorter reads?), or some other problem I haven’t found yet (typo in my batch file)?

Thanks,
Zak

I would strongly discourage the use of PCA. Instead calculate a distance matrix using dist.shared with something like Bray-Curtis or ThetaYC and run that distance matrix through PCoA. There is an example of this in the MiSeq SOP wiki page.

Pat

Thanks Pat. I will try that. Could you be a little more specific as to why you discourage PCA?

Zak

PCA essentially uses R2 as a distance between samples. This weights double zeros the same as double ones. In other words, if an OTU is missing from the two samples being compared, it will inflate the similarity between samples. Other metrics that are widely used in ecology (e.g. Bray-Curits and ThetaYC ignore these double zeros). PCA is more appropriate for comparing communities based on their metadata.

Pat

Thank you very much for that, Pat.

Zak