Using libshuff to compare pyrosequencing samples

Hello-

I have been trying to use libshuff to compare 16S libraries for different samples (pyrosequencing data). I am doing this because I do not have replicates of all samples so tests like anosim and anova are not available to me. The main problem I am encountering is twofold:

  1. I am using the “unique” sequencing for each sample instead of the full library. I am concerned this may be muting my calculation of the differences between 2 libraries. Since libshuff calculates the number of sequences that are unique to each library, that number will be lower since I am only using unique sequences for each library. Is it recommended to use all sequences or just unique ones?
  2. Prior to this libshuff analysis I have removed all sequences that appear 10 or less times (rare) using the split.abund command. As such, my dataset contains 0 singletons. When I use summary.single to calculate Good’s coverage it is always nearly 100%. Since libshuff also calculates Good’s coverage I am concerned that this is biasing the results.

Interestingly enough I was able to find some significant differences between samples despite the above caveats but I am concerned that I have violated one of the fundamental assumptions of LIBSHUFF, namely that the dataset contains singletons. Can these results be trusted?

Thanks,
Kristina

Urggghhh, libshuff… I would suggest that libshuff has probably outlived it’s usefulness. The wonky statistic, lack of an ability to incorporate replication, and the availability of other, suggest that we scrap the method. At the level of coverage given by 454 sequencing, I’d actually be surprised if most things don’t come back as being significant. Frankly, I would get more replicates and/or calculate a beta-diversity measure between your samples. Regardless…

I am using the “unique” sequencing for each sample instead of the full library. I am concerned this may be muting my calculation of the differences between 2 libraries. Since libshuff calculates the number of sequences that are unique to each library, that number will be lower since I am only using unique sequences for each library. Is it recommended to use all sequences or just unique ones?

All of them

Prior to this libshuff analysis I have removed all sequences that appear 10 or less times (rare) using the split.abund command. As such, my dataset contains 0 singletons. When I use summary.single to calculate Good’s coverage it is always nearly 100%. Since libshuff also calculates Good’s coverage I am concerned that this is biasing the results.

Yeah, this isn’t a good idea since you are artificially skewing the coverages to 100% for Cx regardless of the distance. I would say that this strategy, regardless of the technique, is probably not a great one. There are chimeras with more than 10 sequences in them and there are sequencing errors represented by more than 10 sequences; the converse is true as well.

Hope this helps…
Pat