Random sampling from data quantiles, while preserving original probability distribution

僤鯓⒐⒋嵵緔 提交于 2019-12-25 14:22:37

问题


Following my previous question titled: "Random sampling from a dataset, while preserving original probability distribution", I want to sample from a set of >2000 numbers, gathered from measurement. I want to perform several tests (I take maximum of 10 samples in each tests), while preserving probability distribution in overall testiong process, and in each test (as much as possible). Now, instead of completely random sampling, I partition data into 5 quantiles, and in 10 tests, I sample 2 data elements from each quantile, using a uniformly random distribution for the array of data in each quantile.

The problem with the completely random sampling was that as the distribution of data is long-tailed, I was getting almost the same values in each test. I want some small value samples, some middle value samples, and some big value samples in each test. So I sampled as described.

Fig 1. Density plot of ~2k elements of data.

This is the R code for calculating quantiles:

q=quantile(data, probs = seq(0, 1, by= 0.1))

And then I partition data into 5 quantiles (each one as an array) and sample from each partition. For example, I do this in Java:

public int getRandomData(int quantile) {
    int data[][] = {1,2,3,4,5}
                  ,{6,7,8,9,10}
                  ,{11,12,13,14,15}
                  ,{16,17,18,19,20}
                  ,{21,22,23,24,25}};
    length=data[quantile][].length;
    Random r=new Random();
    int randomInt = r.nextInt(length);
    return data[quantile][randomInt];
}

So, does the samples for each tests and all tests overall, preserve the characteristics of the original distribution, for example mean and variance? If not, how to arrange sampling to achieve this goal?


回答1:


preserve the characteristics of the original distribution, for example mean and variance?

This will have a similar distribution. You might want to have an additional check to ensure it meets your requirement, and perhaps try again, but this will get you close.

If not, how to arrange sampling to achieve this goal?

Unless you have duplication of all data i.e. double everything, you need to have one of every sample value. This is the only way to get exactly the same distribution.



来源:https://stackoverflow.com/questions/32550059/random-sampling-from-data-quantiles-while-preserving-original-probability-distr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!