Simple linear regression using pandas dataframe

僤鯓⒐⒋嵵緔 提交于 2020-01-03 11:39:07

问题


I'm looking to check trends for a number of entities (SysNr)

I have data spanning 3 years (2014,2015,2016)

I'm looking at a large quantity of variables, but will limit this question to one ('res_f_r')

My DataFrame looks something like this

d = [
    {'RegnskabsAar': 2014, 'SysNr': 1, 'res_f_r': 350000},
    {'RegnskabsAar': 2015, 'SysNr': 1, 'res_f_r': 400000},
    {'RegnskabsAar': 2016, 'SysNr': 1, 'res_f_r': 450000},
    {'RegnskabsAar': 2014, 'SysNr': 2, 'res_f_r': 350000},
    {'RegnskabsAar': 2015, 'SysNr': 2, 'res_f_r': 300000},
    {'RegnskabsAar': 2016, 'SysNr': 2, 'res_f_r': 250000},
]

df = pd.DataFrame(d)



   RegnskabsAar  SysNr  res_f_r
0          2014      1   350000
1          2015      1   400000
2          2016      1   450000
3          2014      2   350000
4          2015      2   300000
5          2016      2   250000

My desire is to do a linear regression on each entity (SysNr) and get returned the slope and intercept

My desired output for the above is

   SysNr  intercept  slope
0      1     300000  50000
1      2     400000 -50000

Any ideas?


回答1:


So I don't know why our intercept values differ (maybe I have made a mistake or your given data is not the full data you expect to work on), but I'd suggest you to use np.polyfit or the tool of your choice (scikit-learn, scipy.stats.linregress, ...) in combination with groupby and apply:

In [25]: df.groupby("SysNr").apply(lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1))
Out[25]:
SysNr
1    [49999.99999999048, -100349999.99998075]
2    [-49999.99999999045, 101049999.99998072]
dtype: object

After that, beautify it:

In [43]: df.groupby("SysNr").apply(
    ...:     lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1)).apply(
    ...:     pd.Series).rename(columns={0:'slope', 1:'intercept'}).reset_index()
Out[43]:
   SysNr    slope     intercept
0      1  50000.0 -1.003500e+08
1      2 -50000.0  1.010500e+08

Edit:

Because you asked on the other answer in the comment about how to handle missing years for some SysNr: Just drop that NaNs for a valid linear regression. Of course you could also fill them with the mean or so, depending on what you want to achieve, but that isn't that helpful from my point of view.

If the entity has only data for one year, you can't apply a linear regression on that usefully. But you can (if you want and that fits your case, please provide more information on the data if needed) extrapolate somehow the slope of the other entities to this one and calculate the intercept. For that of course you must make some assumptions on the distribution of the slope of the entities (e.g. linear, then the slope of sysNr 3 would be -150000.0).




回答2:


You can also use linregress from scipy.stats with groupby from the pandas:

from scipy.stats import linregress

# groupby column
grouped = df.groupby('SysNr')

# https://stackoverflow.com/a/14775604/5916727
# apply linear regression to each group
result_df = pd.DataFrame(grouped.apply(lambda x: linregress(x['RegnskabsAar'], x['res_f_r']))).reset_index()

# https://stackoverflow.com/a/29550458/5916727
# expand result to each column
result_df[['slope', 'intercept', 'r_value', 'p_value', 'std_err']] = result_df[0].apply(pd.Series)

# drop initial column with all in one
del result_df[0]

result_df

Result:

   SysNr    slope    intercept  r_value       p_value  std_err
0      1  50000.0 -100350000.0      1.0  9.003163e-11      0.0
1      2 -50000.0  101050000.0     -1.0  9.003163e-11      0.0


来源:https://stackoverflow.com/questions/48767977/simple-linear-regression-using-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!