What is the best algorithm to take a long sequence of integers (say 100,000 of them) and return a measurement of how random the sequence is?
The function should retu
Measuring randomness? In order to do so, you should fully understand its meaning. The problem is, if you search the internet you will reach the conclusion that there isn't a uniform concept of randomness. For some people it's one thing, for others it's something else. Randomness and uncertainty are two different things. One of the most frequent misleading concepts is to test if "it's random or not random". Randomness is not an absolute concept, it's a relative one. It's not a "yes" or "no". So you could not determine that something is random or not random. What could be determined is actually the "randomness" or the degree of "randomness". To say that something is random or not random in an absolute way would be wrong because it's relative and even subjective for that matter. In the same way, it is subjective and relative to say that something follows a pattern or doesn't because what's a pattern? In order to measure randomness, you have to start off by understanding it's the mathematical theoretical premise. The premise behind randomness is easy to understand and accept. If all possible outcomes have the EXACT same probability of happening than randomness is achieved to it's fullest extent. It's that simple. What is more difficult to understand is linking this concept/premise to a certain sequence or a distribution of outcomes of events in order to determine a degree of randomness. You could divide your sample into sets or sub-sets and they could prove to be relatively random. The problem is that even if they prove to be random by themselves, it could be proven that the sample is not that random if analyzed as a whole. So, in order to analyze the degree of randomness, you should consider the sample as a whole and not sub-divided. There are no 7 tests or 5 tests, as referred here, there is only one. And that test follows the already mentioned premise and thus determines the degree of randomness based on the outcome distribution type or in other words, the outcome frequency distribution type of a given sample. The specific sequence of a sample is not relevant. If you consider the variable n(possible outcomes) and t(number of trials/events or number of elements in your set) you will have a number of total possible sequences of (n^t) or (n to the power of t). If we consider our premise to be true, any of these possible sequences have the exact same probability of occurring. Since all the possible sequences have the exact same probability of occurring, any specific sequence would be inconclusive in order to calculate the "randomness" of our sample. What would be essential is to calculate the probability of the outcome distribution type of our sample of happening. In order to do so, we would have to calculate all the sequences that are associated with the outcome distribution type of our sample. So if you consider s=(all possible sequences that lead to our outcome distribution type), then s/(n^t) would give you a value between 0 and 1 which should be interpreted as being a measurement of randomness for a specific sample. Being that 1 is 100% random and 0 is 0% random. It should be said that you will never get a 1 or a 0 because even if your sample coincides with the MOST likely random outcome distribution type it could never be proven as being 100%. And if your sample coincides with the LEAST likely random outcome distribution type it could never be proven as being 0%. This happens because since there are several outcome distribution types, no single one of them can represent is 100% or 0%. In order to determine the value of variables, you should use the same logic behind multinominal distribution probabilities. This applies to any number of possible outcomes and to any length of sequences/trials/events. Of course, if your sample or sequence is big, an enormous amount of calculations will be required.