Performing grouped average and standard deviation with NumPy arrays

匿名 (未验证) 提交于 2019-12-03 08:48:34

问题:

I have a set of data (X,Y). My independent variable values X are not unique, so there are multiple repeated values, I want to output a new array containing : X_unique, which is a list of unique values of X. Y_mean, the mean of all of the Y values corresponding to X_unique. Y_std, the standard deviation of all the Y values corresponding to X_unique.

x = data[:,0] y = data[:,1] 

回答1:

x_unique  = np.unique(x) y_means = np.array([np.mean(y[x==u]) for u in x_unique]) y_stds = np.array([np.std(y[x==u]) for u in x_unique]) 


回答2:

You can use binned_statistic from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique would be useful. Putting all those, here's an implementation -

from scipy.stats import binned_statistic as bstat  # Sort data corresponding to argsort of first column sdata = data[data[:,0].argsort()]  # Unique col-1 elements and positions of breaks (elements are not identical) unq_x,breaks = np.unique(sdata[:,0],return_index=True) breaks = np.append(breaks,data.shape[0])  # Use binned statistic to get grouped average and std deviation values idx_range = np.arange(data.shape[0]) avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks) std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks) 

From the docs of binned_statistic, one can also use a custom statistic function :

function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.

Sample input, output -

In [121]: data Out[121]:  array([[2, 5],        [2, 2],        [1, 5],        [3, 8],        [0, 8],        [6, 7],        [8, 1],        [2, 5],        [6, 8],        [1, 8]])  In [122]: np.column_stack((unq_x,avg_y,std_y)) Out[122]:  array([[ 0.        ,  8.        ,  0.        ],        [ 1.        ,  6.5       ,  1.5       ],        [ 2.        ,  4.        ,  1.41421356],        [ 3.        ,  8.        ,  0.        ],        [ 6.        ,  7.5       ,  0.5       ],        [ 8.        ,  1.        ,  0.        ]]) 


回答3:

Pandas is done for such task :

data=np.random.randint(1,5,20).reshape(10,2) import pandas pandas.DataFrame(data).groupby(0).mean() 

gives

          1 0           1  2.666667 2  3.000000 3  2.000000 4  1.500000 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!