Can we conclude a set might not be random by checking its subset?

问题

Set A includes 1000 numbers. I checked that half of the numbers in this set are even.

I extracted subset B from set A as follow: any number in set A which starts with 1 is also in set B. (All numbers in B start with 1).

I checked that more than half of the numbers in set B are even.

Half of the numbers in A are even so should we expect the same for B? But more than half of B are even. So can conclude that set A is not random?

If 60% of B are even, can we still conclude A is not generated random?

How if 70% of B are odd?

回答1:

That depends entirely on how large the sample is.

From basic probability, if p is the probability of getting a "success" (the outcome you're focused on) from a binary trial, q = (1-p) is the probability of getting a "failure" (the alternative outcome). Let n be the number of trials. If the trials are independent, the number of outcomes X has a binomial distribution with parameters n and p, and p-hat = X/n is an unbiased estimator for p. The mean and variance of p-hat are p and pq/n, respectively, and for sufficiently large sample sizes the distribution converges to Gaussian (the bell-shaped curve). Based on that, as long as p and q are sufficiently bigger than 0 we can say that in repeated experiments about 95% of them should fall within a distance of 1.96*sqrt(pq/n) of the true mean. That distance is called the margin of error (ME).

You're conjecturing that p = 1/2. Consequently, your margin of error is ME = 1.96*sqrt(pq/n) = 0.98/sqrt(n). You can invert to find out how much of a sample size you need to obtain a particular ME: n = ceiling((0.98/ME)²).

Plugging in some particular margins of error:

ME = 0.20 ==> n = 25 (borderline to believe Gaussian convergence)
ME = 0.10 ==> n = 97
ME = 0.05 ==> n = 385
ME = 0.03 ==> n = 1068
ME = 0.01 ==> n = 9604

In other words, the smaller you want your margin of error to be, the larger the sample size required, and the sampling requirement grows quadratically.

Those last two are relevant to political polling. It's common to take sample sizes around 1000 and report the estimates as having a margin of error of &pm;3%. People would intuitively like &pm;1%, but it would take 9 times the sampling and is deemed to not be cost-effective.

Bringing this back around to your question, based on the size of your subset you can make a probabilistic statement about how plausible you find your conjecture that p = 1/2, but it's going to take hundreds or thousands of values to make that a tight bound.

Also, please note that non-uniform or non-independent are not the same thing as non-random. The test you're trying to perform is for uniformity of select bits, and tells you nothing about the other bits nor about the independence of the data.

来源：https://stackoverflow.com/questions/43083812/can-we-conclude-a-set-might-not-be-random-by-checking-its-subset

标签

testing

random

numbers

set

subset