Random integers in C, how bad is rand()%N compared to integer arithmetic? What are its flaws?

走远了吗. 提交于 2019-12-10 04:02:04

问题


EDIT: My question is: rand()%N is considered very bad, whereas the use of integer arithmetic is considered superior, but I cannot see the difference between the two.

People always mention:

  • low bits are not random in rand()%N,

  • rand()%N is very predictable,

  • you can use it for games but not for cryptography

Can someone explain if any of these points are the case here and how to see that?

The idea of the non-randomness of the lower bits is something that should make the PE of the two cases that I show differ, but it's not the case.

I guess many like me would always avoid using rand(), or rand()%N because we've been always taught that it is pretty bad. I was curious to see how "wrong" random integers generated with c rand()%N effectively are. This is also a follow up to Ryan Reich's answer in How to generate a random integer number from within a range.

The explanation there sounds very convincing, to be honest; nevertheless, I thought I’d give it a try. So, I compare the distributions in a VERY naive way. I run both random generators for different numbers of samples and domains. I didn't see the point of computing a density instead of histograms, so I just computed histograms and, just by looking, I would say they both look just as uniform. Regarding the other point that was raised, about the actual randomness (despite being uniformly distributed). I — again naively —compute the permutation entropy for these runs, which are the same for both sample sets, which tell us that there's no difference between both regarding the ordering of the occurrence.

So, for many purposes, it seems to me that rand()%N would be just fine, how can we see their flaws?

Here I show you a very simple, inefficient and not very elegant (but I think correct) way of computing these samples and get the histograms together with the permutation entropies. I show plots for domains (0,i) with i in {5,10,25,50,100} for different number of samples:

There's not much to see in the code I guess, so I will leave both the C and the matlab code for replication purposes.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main(int argc, char *argv[]){
        unsigned long max = atoi(argv[2]);
        int samples=atoi(argv[3]);
        srand(time(NULL));
        if(atoi(argv[1])==1){
                for(int i=0;i<samples;++i)
                        printf("%ld\n",rand()%(max+1));

        }else{
                for(int i=0;i<samples;++i){
                        unsigned long
                        num_bins = (unsigned long) max + 1,
                        num_rand = (unsigned long) RAND_MAX + 1,
                        bin_size = num_rand / num_bins,
                        defect   = num_rand % num_bins;

                        long x;
                        do {
                                x = rand();
                        }
                        while (num_rand - defect <= (unsigned long)x);
                        printf("%ld\n",x/bin_size);
                }
        }
        return 0;
}

And here is the Matlab code to plot this and compute the PEs (the recursion for the permutations I took it from: https://www.mathworks.com/matlabcentral/answers/308255-how-to-generate-all-possible-permutations-without-using-the-function-perms-randperm):

system('gcc randomTest.c -o randomTest.exe;');
max = 100;
samples = max*10000;
trials = 200;
system(['./randomTest.exe 1 ' num2str(max) ' ' num2str(samples) ' > file1'])
system(['./randomTest.exe 2 ' num2str(max) ' ' num2str(samples) ' > file2'])
a1=load('file1');
a2=load('file2');
uni = figure(1);
title(['Samples: ' num2str(samples)])
subplot(1,3,1)
h1 = histogram(a1,max+1);
title('rand%(max+1)')
subplot(1,3,2)
h2 = histogram(a2,max+1);
title('Integer arithmetic')
as=[a1,a2];
ns=3:8;
H = nan(numel(ns),size(as,2));
for op=1:size(as,2)
    x = as(:,op);
    for n=ns
        sequenceOcurrence = zeros(1,factorial(n));
        sequences = myperms(1:n);
        sequencesArrayIdx = sum(sequences.*10.^(size(sequences,2)-1:-1:0),2);
        for i=1:numel(x)-n
            [~,sequenceOrder] = sort(x(i:i+n-1));
            out = sequenceOrder'*10.^(numel(sequenceOrder)-1:-1:0).';
            sequenceOcurrence(sequencesArrayIdx == out) = sequenceOcurrence(sequencesArrayIdx == out) + 1;
        end
        chunks = length(x) - n + 1;
        ps = sequenceOcurrence/chunks;
        hh = sum(ps(logical(ps)).*log2(ps(logical(ps))));
        H(n,op) = hh/log2(factorial(n));
    end
end
subplot(1,3,3)
plot(ns,H(ns,:),'--*','linewidth',2)
ylabel('PE')
xlabel('Sequence length')
filename = ['all_' num2str(max) '_' num2str(samples) ];
export_fig(filename)

回答1:


Due to the way modulo arithmetic works if N is significant compared to RAND_MAX doing %N will make it so you're considerably more likely to get some values than others. Imagine RAND_MAX is 12, and N is 9. If the distribution is good then the chances of getting one of 0, 1, or 2 is 0.5, and the chances of getting one of 3, 4, 5, 6, 7, 8 is 0.5. The result being that you're twice as likely to get a 0 instead of a 4. If N is an exact divider of RAND_MAX this distribution problem doesn't happen, and if N is very small compared to RAND_MAX the issue becomes less noticeable. RAND_MAX may not be a particularly large value (maybe 2^15 - 1), making this problem worse than you may expect. The alternative of doing (rand() * n) / (RAND_MAX + 1) also doesn't give an even distribution, however, it will be every mth value (for some m) that will be more likely to occur rather than the more likely values all being at the low end of the distribution.

If N is 75% of RAND_MAX then the values in the bottom third of your distribution are twice as likely as the values in the top two thirds (as this is where the extra values map to)

The quality of rand() will depend on the implementation of the system that you're on. I believe that some systems have had very poor implementation, OS Xs man pages declare rand obsolete. The Debian man page says the following:

The versions of rand() and srand() in the Linux C Library use the same random number generator as random(3) and srandom(3), so the lower-order bits should be as random as the higher-order bits. However, on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher- order bits. Do not use this function in applications intended to be portable when good randomness is needed. (Use random(3) instead.)




回答2:


Both approaches have their pitfalls, and your graphs are little more than a pretty verification of the central limit theorem! For a sensible implementation of rand():

  1. % N suffers from a "pigeon-holing" effect if 1u + RAND_MAX is not a multiple of N

  2. /((RAND_MAX + 1u)/N) does not, in general, evenly distribute the return of rand across your range, due to integer truncation effects.

On balance, if N is small cf. RAND_MAX, I'd plump for % for its tractability. In any case test your generator to see it it has the appropriate statistical properties for your application.




回答3:


rand() % N is considered extremely poor not because the distribution is bad, but because the randomness is poor-to-nonexistent. (If anything the distribution will be too good.)

If N is not small with respect to RAND_MAX, both

rand() % N

and

rand() / (RAND_MAX / N + 1)

will have more or less the same, poor distribution -- certain values will occur with significantly higher probability than others.

Looking at distribution histograms won't show you that for some implementations, rand() % N has a much, much worse problem -- to show that you'd have to perform some correlations with previous values. (For example, try taking rand() % 2, then subtracting from the previous value you got, and plotting a histogram of the differences. If the difference is never 0, you've got a problem.)

I would like to say that the implementations for which rand()'s low-order bits aren't random are simply buggy. I'd like to think that all those buggy implementations would have disappeared by now. I'd like to think that programmers shouldn't have to worry about calling rand()%N any more. But, unfortunately, my wishes don't change the fact that this seems to be one of those bugs that never get fixed, meaning that programmers do still have to worry.

See also the C FAQ list, question 13.16.



来源:https://stackoverflow.com/questions/49880304/random-integers-in-c-how-bad-is-randn-compared-to-integer-arithmetic-what-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!