Weighted percentile using numpy

前端 未结 12 2208
一个人的身影
一个人的身影 2020-12-01 03:28

Is there a way to use the numpy.percentile function to compute weighted percentile? Or is anyone aware of an alternative python function to compute weighted percentile?

相关标签:
12条回答
  • 2020-12-01 04:01

    Apologies for the additional (unoriginal) answer (not enough rep to comment on @nayyarv's). His solution worked for me (ie. it replicates the default behavior of np.percentage), but I think you can eliminate the for loop with clues from how the original np.percentage is written.

    def weighted_percentile(a, q=np.array([75, 25]), w=None):
        """
        Calculates percentiles associated with a (possibly weighted) array
    
        Parameters
        ----------
        a : array-like
            The input array from which to calculate percents
        q : array-like
            The percentiles to calculate (0.0 - 100.0)
        w : array-like, optional
            The weights to assign to values of a.  Equal weighting if None
            is specified
    
        Returns
        -------
        values : np.array
            The values associated with the specified percentiles.  
        """
        # Standardize and sort based on values in a
        q = np.array(q) / 100.0
        if w is None:
            w = np.ones(a.size)
        idx = np.argsort(a)
        a_sort = a[idx]
        w_sort = w[idx]
    
        # Get the cumulative sum of weights
        ecdf = np.cumsum(w_sort)
    
        # Find the percentile index positions associated with the percentiles
        p = q * (w.sum() - 1)
    
        # Find the bounding indices (both low and high)
        idx_low = np.searchsorted(ecdf, p, side='right')
        idx_high = np.searchsorted(ecdf, p + 1, side='right')
        idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1
    
        # Calculate the weights 
        weights_high = p - np.floor(p)
        weights_low = 1.0 - weights_high
    
        # Extract the low/high indexes and multiply by the corresponding weights
        x1 = np.take(a_sort, idx_low) * weights_low
        x2 = np.take(a_sort, idx_high) * weights_high
    
        # Return the average
        return np.add(x1, x2)
    
    # Sample data
    a = np.array([1.0, 2.0, 9.0, 3.2, 4.0], dtype=np.float)
    w = np.array([2.0, 1.0, 3.0, 4.0, 1.0], dtype=np.float)
    
    # Make an unweighted "copy" of a for testing
    a2 = np.repeat(a, w.astype(np.int))
    
    # Tests with different percentiles chosen
    q1 = np.linspace(0.0, 100.0, 11)
    q2 = np.linspace(5.0, 95.0, 10)
    q3 = np.linspace(4.0, 94.0, 10)
    for q in (q1, q2, q3):
        assert np.all(weighted_percentile(a, q, w) == np.percentile(a2, q))
    
    0 讨论(0)
  • 2020-12-01 04:03
    def weighted_percentile(a, percentile = np.array([75, 25]), weights=None):
        """
        O(nlgn) implementation for weighted_percentile.
        """
        percentile = np.array(percentile)/100.0
        if weights is None:
            weights = np.ones(len(a))
        a_indsort = np.argsort(a)
        a_sort = a[a_indsort]
        weights_sort = weights[a_indsort]
        ecdf = np.cumsum(weights_sort)
    
        percentile_index_positions = percentile * (weights.sum()-1)+1
        # need the 1 offset at the end due to ecdf not starting at 0
        locations = np.searchsorted(ecdf, percentile_index_positions)
    
        out_percentiles = np.zeros(len(percentile_index_positions))
    
        for i, empiricalLocation in enumerate(locations):
            # iterate across the requested percentiles 
            if ecdf[empiricalLocation-1] == np.floor(percentile_index_positions[i]):
                # i.e. is the percentile in between 2 separate values
                uppWeight = percentile_index_positions[i] - ecdf[empiricalLocation-1]
                lowWeight = 1 - uppWeight
    
                out_percentiles[i] = a_sort[empiricalLocation-1] * lowWeight + \
                                     a_sort[empiricalLocation] * uppWeight
            else:
                # i.e. the percentile is entirely in one bin
                out_percentiles[i] = a_sort[empiricalLocation]
    
        return out_percentiles
    

    This is my function, it give identical behaviour to

    np.percentile(np.repeat(a, weights), percentile)
    

    With less memory overhead. np.percentile is an O(n) implementation so it's potentially faster for small weights. It has all the edge cases sorted out - it's an exact solution. The interpolation answers above assume linear, when it's a step for most of the case, except when the weight is 1.

    Say we have data [1,2,3] with weights [3, 11, 7] and I want the 25% percentile. My ecdf is going to be [3, 10, 21] and I'm looking for the 5th value. The interpolation will see [3,1] and [10, 2] as the matches and interpolate giving 1.28 despite being entirely in the 2nd bin with a value of 2.

    0 讨论(0)
  • 2020-12-01 04:04

    Completely vectorized numpy solution

    Here is the code I use. It's not an optimal one (which I'm unable to write with numpy), but still much faster and more reliable than accepted solution

    def weighted_quantile(values, quantiles, sample_weight=None, 
                          values_sorted=False, old_style=False):
        """ Very close to numpy.percentile, but supports weights.
        NOTE: quantiles should be in [0, 1]!
        :param values: numpy.array with data
        :param quantiles: array-like with many quantiles needed
        :param sample_weight: array-like of the same length as `array`
        :param values_sorted: bool, if True, then will avoid sorting of
            initial array
        :param old_style: if True, will correct output to be consistent
            with numpy.percentile.
        :return: numpy.array with computed quantiles.
        """
        values = np.array(values)
        quantiles = np.array(quantiles)
        if sample_weight is None:
            sample_weight = np.ones(len(values))
        sample_weight = np.array(sample_weight)
        assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
            'quantiles should be in [0, 1]'
    
        if not values_sorted:
            sorter = np.argsort(values)
            values = values[sorter]
            sample_weight = sample_weight[sorter]
    
        weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
        if old_style:
            # To be convenient with numpy.percentile
            weighted_quantiles -= weighted_quantiles[0]
            weighted_quantiles /= weighted_quantiles[-1]
        else:
            weighted_quantiles /= np.sum(sample_weight)
        return np.interp(quantiles, weighted_quantiles, values)
    

    Examples:

    weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.])

    array([ 1. , 3.2, 9. ])

    weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.], sample_weight=[2, 1, 2, 4, 1])

    array([ 1. , 3.2, 9. ])

    0 讨论(0)
  • 2020-12-01 04:09

    Cleaner and simpler using this reference for weighted percentile method.

    import numpy as np
    
    def weighted_percentile(data, weights, perc):
        """
        perc : percentile in [0-1]!
        """
        ix = np.argsort(data)
        data = data[ix] # sort data
        weights = weights[ix] # sort weights
        cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
        return np.interp(perc, cdf, data)
    
    0 讨论(0)
  • 2020-12-01 04:12

    The weightedcalcs package supports quantiles:

    import weightedcalcs as wc
    import pandas as pd
    
    df = pd.DataFrame({'v': [1, 2, 3], 'w': [3, 2, 1]})
    calc = wc.Calculator('w')  # w designates weight
    
    calc.quantile(df, 'v', 0.5)
    # 1.5
    
    0 讨论(0)
  • 2020-12-01 04:13

    This seems to be now implemented in statsmodels

    from statsmodels.stats.weightstats import DescrStatsW
    wq = DescrStatsW(data=np.array([1, 2, 9, 3.2, 4]), weights=np.array([0.0, 0.5, 1.0, 0.3, 0.5]))
    wq.quantile(probs=np.array([0.1, 0.9]), return_pandas=False)
    # array([2., 9.])
    

    The DescrStatsW object also has other methods implemented, such as weighted mean, etc. https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html

    0 讨论(0)
提交回复
热议问题