Fit a curve for data made up of two distinct regimes

前端 未结 5 1454
余生分开走
余生分开走 2020-12-10 04:00

I\'m looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after

相关标签:
5条回答
  • 2020-12-10 04:47

    If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.

    The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):

    As a more useful alternative form we consider an hyperbola for which:
    (i) the dependent variable y is a single valued function of the independent variable x,
    (ii) the left asymptote has slope theta_1,
    (iii) the right asymptote has slope theta_2,
    (iv) the asymptotes intersect at the point (x_o, beta_o),
    (v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.

    delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.

    Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
    BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.

    0 讨论(0)
  • 2020-12-10 04:50

    I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import lmfit
    
    dfseg = pd.read_csv('segreg.csv')
    def err(w):
        th0 = w['th0'].value
        th1 = w['th1'].value
        th2 = w['th2'].value
        gamma = w['gamma'].value
        fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
        return fit-dfseg.C
    
    p = lmfit.Parameters()
    p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
    mi = lmfit.minimize(err, p)
    lmfit.printfuncs.report_fit(mi.params)
    
    b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
    gamma = int(mi.params['gamma'].value)
    
    import statsmodels.formula.api as smf
    reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
    print reslin.summary()
    
    x0 = np.array(range(0,gamma,1))
    x1 = np.array(range(0,80-gamma,1))
    y0 = b0 + b1*x0
    y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
    plt.scatter(dfseg.Temp, dfseg.C)
    plt.hold(True)
    plt.plot(x0,y0)
    plt.plot(x1+gamma,y1)
    
    plt.show()
    

    Result

    [[Variables]]
        th0:     78.6554456 +/- 3.966238 (5.04%) (init= 0)
        th1:    -0.15728297 +/- 0.148250 (94.26%) (init= 0)
        th2:     0.72471237 +/- 0.179052 (24.71%) (init= 0)
        gamma:   38.3110177 +/- 4.845767 (12.65%) (init= 40)
    

    The data

    "","Temp","C"
    "1",8.5536,86.2143
    "2",10.6613,72.3871
    "3",12.4516,74.0968
    "4",16.9032,68.2258
    "5",20.5161,72.3548
    "6",21.1613,76.4839
    "7",24.3929,83.6429
    "8",26.4839,74.1935
    "9",26.5645,71.2581
    "10",27.9828,78.2069
    "11",32.6833,79.0667
    "12",33.0806,71.0968
    "13",33.7097,76.6452
    "14",34.2903,74.4516
    "15",36,56.9677
    "16",37.4167,79.8333
    "17",43.9516,79.7097
    "18",45.2667,76.9667
    "19",47,76
    "20",47.1129,78.0323
    "21",47.3833,79.8333
    "22",48.0968,73.9032
    "23",49.05,78.1667
    "24",57.5,81.7097
    "25",59.2,80.3
    "26",61.3226,75
    "27",61.9194,87.0323
    "28",62.3833,89.8
    "29",64.3667,96.4
    "30",65.371,88.9677
    "31",68.35,91.3333
    "32",70.7581,91.8387
    "33",71.129,90.9355
    "34",72.2419,93.4516
    "35",72.85,97.8333
    "36",73.9194,92.4839
    "37",74.4167,96.1333
    "38",76.3871,89.8387
    "39",78.0484,89.4516
    

    Graph

    0 讨论(0)
  • 2020-12-10 04:54

    I used @user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.

    Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in @user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:

    # least_splines_calc works like this for an example with three segments    
    # (four threshold params, three gamma params):
    #
    # for      0 < x < gamma0 : y = th0 + (th1 * x) 
    # for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0)) 
    # for gamma1 < x          : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))  
    #
    
    def least_splines_calc(x, thresholds, gammas):
    
        if(len(thresholds) < 2):
            print("Error: expected at least two thresholds")
            return None
    
        applicable_gammas = filter(lambda gamma: x > gamma , gammas)
    
        #base result  
        y = thresholds[0] + (thresholds[1] * x)
    
        #additional factors calculated depending on x value
        for i in range(0, len(applicable_gammas)):
            y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
    
        return y
    
    def least_splines_calc_array(x_array, thresholds, gammas):
        y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
        return y_array
    
    def err(params, x, data):
    
        th0 = params['th0'].value
        th1 = params['th1'].value
        th2 = params['th2'].value
        th3 = params['th3'].value
        gamma1 = params['gamma1'].value
        gamma2 = params['gamma2'].value
    
        thresholds = np.array([th0, th1, th2, th3])
        gammas = np.array([gamma1, gamma2])
    
    
        fit = least_splines_calc_array(x, thresholds, gammas)
    
        return np.array(fit)-np.array(data)
    
    p = lmfit.Parameters()
    p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
    
    mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
    

    After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.

    Reference: While there's various places that explain least splines (I think @user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:

    0 讨论(0)
  • 2020-12-10 04:56

    If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.optimize import curve_fit
    
    def two_lines(x, a, b, c, d):
        one = a*x + b
        two = c*x + d
        return np.maximum(one, two)
    

    Then,

    x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
    
    pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
    pw, cov = curve_fit(two_lines, x, y, pw0)
    crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
    
    plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
    

    If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:

    def hyperbola(x, a, b, c, d, e):
        """ hyperbola(x) with parameters
            a/b = asymptotic slope
             c  = curvature at vertex
             d  = offset to vertex
             e  = vertical offset
        """
        return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
    
    def rot_hyperbola(x, a, b, c, d, e, th):
        pars = a, b, c, 0, 0 # do the shifting after rotation
        xd = x - d
        hsin = hyperbola(xd, *pars)*np.sin(th)
        xcos = xd*np.cos(th)
        return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
    

    Run it as

    h0 = 1.1, 1, 0, 5000, 100, .5
    h, hcov = curve_fit(rot_hyperbola, x, y, h0)
    plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
    plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
    plt.show()
    

    bent data fits

    I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!

    def line_exp(x, a, b, c, d, e):
        return a*x + b + c*np.exp((x-d)/e)
    
    e0 = .1, 20., .01, 1000., 2000.
    e, ecov = curve_fit(line_exp, x, y, e0)
    

    If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)

    from scipy.interpolate import UnivariateSpline
    s = UnivariateSpline(x, y, s=x.size)  #larger s-value has fewer "knots"
    plt.plot(x, s(x))
    

    with line+exp and polynomial

    0 讨论(0)
  • 2020-12-10 05:01

    There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf

    The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.

    Note that the abscisses and ordinates scales have been devised by 1000.

    The equations of the two segments are

    The approximate values of the five parameters are written on the above figure.

    0 讨论(0)
提交回复
热议问题