Can scipy.stats identify and mask obvious outliers?

后端 未结 4 640
终归单人心
终归单人心 2020-12-07 19:44

With scipy.stats.linregress I am performing a simple linear regression on some sets of highly correlated x,y experimental data, and initially visually inspecting each x,y sc

4条回答
  •  隐瞒了意图╮
    2020-12-07 19:46

    It is also possible to limit the effect of outliers using scipy.optimize.least_squares. Especially, take a look at the f_scale parameter:

    Value of soft margin between inlier and outlier residuals, default is 1.0. ... This parameter has no effect with loss='linear', but for other loss values it is of crucial importance.

    On the page they compare 3 different functions: the normal least_squares, and two methods involving f_scale:

    res_lsq =     least_squares(fun, x0, args=(t_train, y_train))
    res_soft_l1 = least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train))
    res_log =     least_squares(fun, x0, loss='cauchy', f_scale=0.1, args=(t_train, y_train))
    

    As can be seen, the normal least squares is a lot more affected by data outliers, and it can be worth playing around with different loss functions in combination with different f_scales. The possible loss functions are (taken from the documentation):

    ‘linear’ : Gives a standard least-squares problem.
    ‘soft_l1’: The smooth approximation of l1 (absolute value) loss. Usually a good choice for robust least squares.
    ‘huber’  : Works similarly to ‘soft_l1’.
    ‘cauchy’ : Severely weakens outliers influence, but may cause difficulties in optimization process.
    ‘arctan’ : Limits a maximum loss on a single residual, has properties similar to ‘cauchy’.
    

    The scipy cookbook has a neat tutorial on robust nonlinear regression.

提交回复
热议问题