calculating Gini coefficient in Python/numpy

后端 未结 4 935
谎友^
谎友^ 2020-12-09 19:51

i\'m calculating Gini coefficient (similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.r

相关标签:
4条回答
  • 2020-12-09 19:58

    A quick note on the original methodology:

    When calculating Gini coefficients directly from areas under curves with np.traps or another integration method, the first value of the Lorenz curve needs to be 0 so that the area between the origin and the second value is accounted for. The following changes to G(v) fix this:

    yvals = [0]
    for b in bins[1:]:
    

    I also discussed this issue in this answer, where including the origin in those calculations provides an equivalent answer to using the other methods discussed here (which do not need 0 to be appended).

    In short, when calculating Gini coefficients directly using integration, start from the origin. If using the other methods discussed here, then it's not needed.

    0 讨论(0)
  • 2020-12-09 20:05

    This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expected value (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.

    You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500). Those values are all close to 10.5; the relative variation is lower than the sample v = np.random.rand(500). In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n) is 1/(6*base + 3).

    Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.

    def gini(x):
        # (Warning: This is a concise implementation, but it is O(n**2)
        # in time and memory, where n = len(x).  *Don't* pass in huge
        # samples!)
    
        # Mean absolute difference
        mad = np.abs(np.subtract.outer(x, x)).mean()
        # Relative mean absolute difference
        rmad = mad/np.mean(x)
        # Gini coefficient
        g = 0.5 * rmad
        return g
    

    Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500):

    In [80]: v = np.random.rand(500)
    
    In [81]: gini(v)
    Out[81]: 0.32760618249832563
    
    In [82]: v = 1 + np.random.rand(500)
    
    In [83]: gini(v)
    Out[83]: 0.11121487509454202
    
    In [84]: v = 10 + np.random.rand(500)
    
    In [85]: gini(v)
    Out[85]: 0.01567937753659053
    
    In [86]: v = 100 + np.random.rand(500)
    
    In [87]: gini(v)
    Out[87]: 0.0016594595244509495
    
    0 讨论(0)
  • 2020-12-09 20:06

    A slightly faster implementation (using numpy vectorization and only computing each difference once):

    def gini_coefficient(x):
        """Compute Gini coefficient of array of values"""
        diffsum = 0
        for i, xi in enumerate(x[:-1], 1):
            diffsum += np.sum(np.abs(xi - x[i:]))
        return diffsum / (len(x)**2 * np.mean(x))
    

    Note: x must be a numpy array.

    0 讨论(0)
  • 2020-12-09 20:18

    Gini coefficient is the area under the Lorence curve, usually calculated for analyzing the distribution of income in population. https://github.com/oliviaguest/gini provides simple implementation for the same using python.

    0 讨论(0)
提交回复
热议问题