Efficiently calculating a segmented regression on a large dataset

我的梦境 提交于 2019-12-13 16:13:23

问题


I currently have a large data set, for which I need to calculate a segmented regression (or fit a piecewise linear function in some similar way). However, I have both a large data set, as well as a very large number of pieces.

Currently I have the following approach:

  • Let si be the end of segment i
  • Let (xi,yi) denote the i-th data point

Assume the data point xk lies within segment j, then I can create a vector from xk as

(s1,s2-s1,s3-s2,...,xk-sj-1,0,0,...)

To do a segmented regression on the data point, I can do a normal linear regression on each of these vectors.

However, my current estimates show, that if I define the problem that way, I will get about 600.000 vectors with about 2.000 components each. I haven't benchmarked yet, but I don't think my computer will be able to calculate such a large regression problem in any acceptable time.

Is there a better way to calculate this kind of regression problem? One idea was to maybe use some kind of hierarchical approach, i.e. calculate one regression problem by combining multiple segments, so that I can determine start and endpoints for this set. Then calculate an individual segmented regression for this set of segments. However, I cannot figure out how to calculate the regression for this set of segments, so that the endpoints match (I can only match start or endpoint by fixing the intercept but not both).

Another idea I had was to calculate an individual regression for each of the segments and then only use the slope for that segment. However with that approach, errors might start to accumulate and I have no way to control for this kind of error accumulation.

Yet another ideas is that I could do individual regression for each segment, but fix the intercept to the endpoint of the previous segment. However, I still am not sure, if I may get some kind of error accumulation this way.

Clarification

Not sure if this was clear from the rest of the question. I know where the segments start and end. The most important part is, that I have to get each line segment to intersect at the segment boundary with the next segment.

EDIT

Maybe another fact that could help. All points have different x values.


回答1:


I would group points to rectangular grid areas

based on their position. So you process this task on more smaller datasets and then merge the results together when all done.

I would process each group like this:

  1. compute histogram of angles
  2. take only the most occurring angles

    their count determine the number of line segments present in group

  3. do the regression/line fit for these angles

    See this Answer it does something very similar (just single line)

  4. compute the intersection points

    between line segments to get the endpoints of your piecewise polyline and also connectivity info (join the closest endpoints)

[edit1] after OP edit

You know the edge x coordinates of all segments (x0,x1,...) so just compute average y coordinates of points near segment edge (gray area, green points) and You got the segment line endpoints (blue points). Of coarse this is no fit or regression because of discard all the other points so it leads to bigger errors (unless the segment x coordinated corresponds to regressed lines ...) but there is no way around it with the constrains of solution you have (at least I do not see any).

Because if you use regression on segment data then you can not connect it to other segments and if you try to merge them then you got almost the same result as this:

the size of gray area determine the output ... so play with it a bit ...



来源:https://stackoverflow.com/questions/29231959/efficiently-calculating-a-segmented-regression-on-a-large-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!