other options (e.g. median and mean) for subsampling


I read a paper evaluating methods of sub-sampling, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3233110/. The authors suggested that sub-sampling to median reads number was more accurate than that to the minimum number.

I thought I would initiate a friendly discussion here regarding to rationals and reasoning for sub-sampling methods.

I suppose that I could use python to “recode” as mentioned in the paper for median sub-sampling. Mothur would do sub-sample to the minimum size. I’m more interested in the rationals.

I would appreciate any opinion.


Subsampling to the median makes zero sense to me. This would mean upsampling samples that have less than the median number of sequences and effectively making up data. People should pick an acceptable threshold and rarefy to that number of sequences.