How can I efficiently calculate the binomial cumulative distribution function?

后端 未结 10 964
你的背包
你的背包 2020-12-07 22:10

Let\'s say that I know the probability of a \"success\" is P. I run the test N times, and I see S successes. The test is akin to tossing an unevenly weighted coin (perhaps

10条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-07 22:27

    I was on a project where we needed to be able to calculate the binomial CDF in an environment that didn't have a factorial or gamma function defined. It took me a few weeks, but I ended up coming up with the following algorithm which calculates the CDF exactly (i.e. no approximation necessary). Python is basically as good as pseudocode, right?

    import numpy as np
    
    def binomial_cdf(x,n,p):
        cdf = 0
        b = 0
        for k in range(x+1):
            if k > 0:
                b += + np.log(n-k+1) - np.log(k) 
            log_pmf_k = b + k * np.log(p) + (n-k) * np.log(1-p)
            cdf += np.exp(log_pmf_k)
        return cdf
    

    Performance scales with x. For small values of x, this solution is about an order of magnitude faster than scipy.stats.binom.cdf, with similar performance at around x=10,000.

    I won't go into a full derivation of this algorithm because stackoverflow doesn't support MathJax, but the thrust of it is first identifying the following equivalence:

    • For all k > 0, sp.misc.comb(n,k) == np.prod([(n-k+1)/k for k in range(1,k+1)])

    Which we can rewrite as:

    • sp.misc.comb(n,k) == sp.misc.comb(n,k-1) * (n-k+1)/k

    or in log space:

    • np.log( sp.misc.comb(n,k) ) == np.log(sp.misc.comb(n,k-1)) + np.log(n-k+1) - np.log(k)

    Because the CDF is a summation of PMFs, we can use this formulation to calculate the binomial coefficient (the log of which is b in the function above) for PMF_{x=i} from the coefficient we calculated for PMF_{x=i-1}. This means we can do everything inside a single loop using accumulators, and we don't need to calculate any factorials!

    The reason most of the calculations are done in log space is to improve the numerical stability of the polynomial terms, i.e. p^x and (1-p)^(1-x) have the potential to be extremely large or extremely small, which can cause computational errors.

    EDIT: Is this a novel algorithm? I've been poking around on and off since before I posted this, and I'm increasingly wondering if I should write this up more formally and submit it to a journal.

提交回复
热议问题