A Good and SIMPLE Measure of Randomness

后端 未结 13 1526
情话喂你
情话喂你 2020-12-14 00:17

What is the best algorithm to take a long sequence of integers (say 100,000 of them) and return a measurement of how random the sequence is?

The function should retu

相关标签:
13条回答
  • 2020-12-14 00:30

    Your question answers itself. "If I were to pass the first 100,000 digits of Pi to the function, it should give a number very close to 1", except the digits of Pi are not random numbers so if your algorithm does not recognise a very specific sequence as being non-random then it's not very good.

    The problem here is there are many types of non random-ness:- eg. "121,351,991,7898651,12398469018461" or "33,27,99,3000,63,231" or even "14297141600464,14344872783104,819534228736,3490442496" are definitely not random.

    I think what you need to do is identify the aspects of randomness that are important to you- distribution, distribution of digits, lack of common factors, the expected number of primes, Fibonacci and other "special" numbers etc. etc.

    PS. The Quick and Dirty (and very effective) test of randomness does the file end up roughly the same size after you gzip it.

    0 讨论(0)
  • 2020-12-14 00:35

    You can treat you 100.000 outputs as possible outcomes of a random variable and calculate associated entropy of it. It will give you a measure of uncertainty. (Following image is from wikipedia and you can find more information on Entropy there.) Simply:

    Entropy formula

    You just need to calculate the frequencies of each number in the sequence. That will give you p(xi) (e.g. If 10 appears 27 times p(10) = 27/L where L is 100.000 for your case.) This should give you the measure of entropy.

    Although it will not give you a number between 0 to 1. Still 0 will be minimal uncertainty. However the upper bound will not be 1. You need to normalize the output to achieve that.

    0 讨论(0)
  • 2020-12-14 00:35

    "How random is this sequence?" is a tough question because fundamentally you're interested in how the sequence was generated. As others have said it's entirely possible to generate sequences that appear random, but don't come from sources that we'd consider random (e.g. digits of pi).

    Most randomness tests seek to answer a slightly different questions, which is: "Is this sequence anomalous with respect to a given model?". If you're model is rolling ten sided dice, then it's pretty easy to quantify how likely a sequence is generated from that model, and the digits of pi would not look anomalous. But if your model is "Can this sequence be easily generated from an algorithm?" it becomes much more difficult.

    0 讨论(0)
  • 2020-12-14 00:37

    In Computer Vision when analysing textures, the problem of trying to gauge the randomness of a texture comes up, in order to segment it. This is exactly the same as your question, because you are trying to determine the randomness of a sequence of bytes/integers/floats. The best discussion I could find of image entropy is http://www.physicsforums.com/showthread.php?t=274518 .

    Basically, its the statistical measure of randomness for a sequence of values.

    I would also try autocorrelation of the sequence with itself. In the autocorrelation result, if there is no peaks other than the first value that means there is no periodicity to your input.

    0 讨论(0)
  • 2020-12-14 00:40

    Measuring randomness? In order to do so, you should fully understand its meaning. The problem is, if you search the internet you will reach the conclusion that there isn't a uniform concept of randomness. For some people it's one thing, for others it's something else. Randomness and uncertainty are two different things. One of the most frequent misleading concepts is to test if "it's random or not random". Randomness is not an absolute concept, it's a relative one. It's not a "yes" or "no". So you could not determine that something is random or not random. What could be determined is actually the "randomness" or the degree of "randomness". To say that something is random or not random in an absolute way would be wrong because it's relative and even subjective for that matter. In the same way, it is subjective and relative to say that something follows a pattern or doesn't because what's a pattern? In order to measure randomness, you have to start off by understanding it's the mathematical theoretical premise. The premise behind randomness is easy to understand and accept. If all possible outcomes have the EXACT same probability of happening than randomness is achieved to it's fullest extent. It's that simple. What is more difficult to understand is linking this concept/premise to a certain sequence or a distribution of outcomes of events in order to determine a degree of randomness. You could divide your sample into sets or sub-sets and they could prove to be relatively random. The problem is that even if they prove to be random by themselves, it could be proven that the sample is not that random if analyzed as a whole. So, in order to analyze the degree of randomness, you should consider the sample as a whole and not sub-divided. There are no 7 tests or 5 tests, as referred here, there is only one. And that test follows the already mentioned premise and thus determines the degree of randomness based on the outcome distribution type or in other words, the outcome frequency distribution type of a given sample. The specific sequence of a sample is not relevant. If you consider the variable n(possible outcomes) and t(number of trials/events or number of elements in your set) you will have a number of total possible sequences of (n^t) or (n to the power of t). If we consider our premise to be true, any of these possible sequences have the exact same probability of occurring. Since all the possible sequences have the exact same probability of occurring, any specific sequence would be inconclusive in order to calculate the "randomness" of our sample. What would be essential is to calculate the probability of the outcome distribution type of our sample of happening. In order to do so, we would have to calculate all the sequences that are associated with the outcome distribution type of our sample. So if you consider s=(all possible sequences that lead to our outcome distribution type), then s/(n^t) would give you a value between 0 and 1 which should be interpreted as being a measurement of randomness for a specific sample. Being that 1 is 100% random and 0 is 0% random. It should be said that you will never get a 1 or a 0 because even if your sample coincides with the MOST likely random outcome distribution type it could never be proven as being 100%. And if your sample coincides with the LEAST likely random outcome distribution type it could never be proven as being 0%. This happens because since there are several outcome distribution types, no single one of them can represent is 100% or 0%. In order to determine the value of variables, you should use the same logic behind multinominal distribution probabilities. This applies to any number of possible outcomes and to any length of sequences/trials/events. Of course, if your sample or sequence is big, an enormous amount of calculations will be required.

    0 讨论(0)
  • 2020-12-14 00:41

    I want to emphasize here that the word "random" means not only uniformly distributed but also independent of everything else (including independent of any other choice).

    There are numerous "randomness tests" available, including tests that estimate p-values from running various statistical probes, as well as tests that estimate min-entropy, which is roughly a minimum "compressibility" level of a bit sequence and the most relevant entropy measure for "secure random number generators". There are also various "randomness extractors", such as the von Neumann and Peres extractors, that could give you an idea on how much "randomness" you can extract from a bit sequence. However, all these tests and methods can only be more reliable on the first part of this definition of randomness ("uniformly distributed") than on the second part ("independent").

    In general, there is no algorithm that can tell, from a sequence of numbers alone, whether the process generated them in an independent and uniformly distributed way, without knowledge on what that process is. Thus, for example, although you can tell that a given sequence of bits has more zeros than ones, you can't tell whether those bits—

    • Were truly generated independently of any other choice, or
    • form part of an extremely long periodic sequence that is only "locally random", or
    • were simply reused from another process, or
    • were produced in some other way,

    ...without more information on the process. As one important example, the process of a person choosing a password is rarely "random" in this sense since passwords tend to contain familiar words or names, among other reasons.

    Also I should discuss the article added to your question in 2019. That article dealt with the task of sampling pseudorandom quantum circuits with a low rate of error (a task specifically designed to be exponentially easier for quantum computers than for classical computers), rather than the task of "verifying" whether a particular sequence of bits (taken out of its context) was generated "at random" in the sense given in this answer. There is an explanation on what exactly this "task" is in a July 2020 paper.

    0 讨论(0)
提交回复
热议问题