Approximate matching of two lists of events (with duration)

问题

I have a black box algorithm that analyses a time series and "detects" certain events in the series. It returns a list of events, each containing a start time and end time. The events do not overlap. I also have a list of the "true" events, again with start time and end time for each event, not overlapping.

I want to compare the two lists and match detected and true events that fall within a certain time tolerance (True Positives). The complication is that the algorithm may detect events that are not really there (False Positives) or might miss events that were there (False Negatives).

What is an algorithm that optimally pairs events from the two lists and leaves the proper events unpaired? I am pretty sure I am not the first one to tackle this problem and that such a method exists, but I haven't been able to find it, perhaps because I do not know the right terminology.

Speed requirement: The lists will contain no more than a few hundred entries, and speed is not a major factor. Accuracy is more important. Anything taking less than a few seconds on an ordinary computer will be fine.

回答1:

Here's a quadratic-time algorithm that gives a maximum likelihood estimate with respect to the following model. Let A1 < ... < Am be the true intervals and let B1 < ... < Bn be the reported intervals. The quantity sub(i, j) is the log-likelihood that Ai becomes Bj. The quantity del(i) is the log-likelihood that Ai is deleted. The quantity ins(j) is the log-likelihood that Bj is inserted. Make independence assumptions everywhere! I'm going to choose sub, del, and ins so that, for every i < i' and every j < j', we have

sub(i, j') + sub(i', j) <= max {sub(i, j )       + sub(i', j')
                               ,del(i) + ins(j') + sub(i', j )
                               ,sub(i, j')       + del(i') + ins(j)
                               }.

This ensures that the optimal matching between intervals is noncrossing and thus that we can use the following Levenshtein-like dynamic program.

The dynamic program is presented as a memoized recursive function, score(i, j), that computes the optimal score of matching A1, ..., Ai with B1, ..., Bj. The root of the call tree is score(m, n). It can be modified to return the sequence of sub(i, j) operations in the optimal solution.

score(i, j) | i == 0 && j == 0 =      0
            | i >  0 && j == 0 =      del(i)    + score(i - 1, 0    )
            | i == 0 && j >  0 =      ins(j)    + score(0    , j - 1)
            | i >  0 && j >  0 = max {sub(i, j) + score(i - 1, j - 1)
                                     ,del(i)    + score(i - 1, j    )
                                     ,ins(j)    + score(i    , j - 1)
                                     }

Here are some possible definitions for sub, del, and ins. I'm not sure if they will be any good; you may want to multiply their values by constants or use powers other than 2. If Ai = [s, t] and Bj = [u, v], then define

sub(i, j) = -(|u - s|^2 + |v - t|^2)
del(i) = -(t - s)^2
ins(j) = -(v - u)^2.

(Apologies to the undoubtedly extant academic who published something like this in the bioinformatics literature many decades ago.)

来源：https://stackoverflow.com/questions/22174839/approximate-matching-of-two-lists-of-events-with-duration

标签

algorithm

pattern-matching