[Solved]: Algorithm to match timestamped events from two sources

Problem Detail: does a good known algorithm exists for this problem? On input I have two series of timestamps “when the event was observed”. Theoretically the recorded timestamps should be very well aligned. Visualized ideal situation on two time lines “s” and “r” as recorded from the two devices: Ideal input situation but more likely they will not be so nicely aligned and there might be missing events from timeline s or r: Visualized timestamps of events in real world conditions I am looking for algorithm to match events from “s” and “r” like this: Matched events So that the result will be something like: (s1,r1); (s2,r2); (s3,null); (s4,r3);(null,r4)… Or something similar. Maybe with some “confidence” rating. It feels like there should be some well known algorithm for this problem, but I am a bit out of my element here. Thanks.

Asked By : urza.cc

Answered By : D.W.

Here is a pragmatic approach:

  1. Look for matching pairs. Consider a pair $(s,r)$ (where $s$ is from the first device and $r$ from the second device) to be a match if their timestamps are very close, i.e., $|s-r| le epsilon$ for some small fixed value of $epsilon$ that you choose in advance. $epsilon$ is a tolerance. To find matching pairs, for each $s$, you look for a value $r$ in the range $[s-epsilon,s+epsilon]$. You can quickly find all matching pairs by first sorting $s$ and $r$ individually, then doing a linear scan across them.
  2. Count how many values don’t have a match. Use the number of values that don’t have a match to derive a confidence score. The more mismatches, the less confident you are.

There are a number of elaborations possible:

  1. For instance, if you’ve found $n$ matching pairs $(s_1,r_1),dots,(s_n,r_n)$, you can use $|s_1-r_1|+dots+|s_n-r_n|$ as an additional confidence measure: the higher this is, the worse a match it is. You could compute a weighted sum of $|s_1-r_1|+dots+|s_n-r_n|$ and the number of mismatched pairs.
  2. If you frequently find that many points can be matched to multiple other possible points (one $s$ has multiple $r_i$’s it could match to), you could try framing this as a bipartite graph matching problem. Basically, you have a vertex for each sample point, and an edge between $s$ and $r$ if $|s-r|le epsilon$; now find a maximum matching in this bipartite graph, and that will give you a way to find the maximum number of matched pairs. Then you can use the number of unmatched values as a confidence score. You can also use some kind of search over a range of $epsilon$ values, if it’s not clear how to choose $epsilon$ a priori. However, this is probably overkill and unnecessary, if events are well-aligned (if the typical distance between two events from a same device is much larger than the typical alignment error in the timestamp for an event at one device vs the timestamp at the other device). If events are well-aligned, simply choosing a suitable threshold $epsilon$ and applying the simple approach at the top of my answer will probably suffice.
  3. If you expect systematic clock skew, where one device’s clock is a bit ahead of the other device’s clock (by a constant), you could correct for this in a number of ways:
    • One simple pragmatic way is to look for a set of matching pairs $(s_1,r_1),dots,(s_n,r_n)$ in a first pass (this first pass doesn’t have to be perfect, so you could use a larger value of $epsilon$). Then, use the median of $s_1-r_1, dots, s_n-r_n$ as your estimate of the clock skew; say the median is $Delta$. Now replace each $s_i$ with $s’_i=s_i-Delta$. Finally, do a second pass where you match the (corrected) $s’_i$’s to the $r_j$’s, maybe with a smaller value of $epsilon$. In practice this might work very well.
    • Alternatively, you could use RANSAC to fit a line $s=r+c$ to the data, where the constant $c$ is unknown. Normally RANSAC is described as a way to fit a line $y=ax+c$ (in this setting, $s=ar+c$), but in your application, we can fix $a=1$ (assuming both clocks run at the same rate and we care only about clock skew) and then apply RANSAC methods to find the best fit line, i.e., to find the constant $c$ that provides the best fit.
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/26227  Ask a Question  Download Related Notes/Documents

Leave a Reply