Detecting a subsequence that's an arithmetic progression, in a sorted sequence

Problem Detail: I have following problem: I have a sorted sequence of $N$ integers (assume they are monotonically increasing). I want to check whether there is any subsequence of length $ge N/4$, such that consecutive elements of the subsequence all differ by the same value. For example, in the sequence [3,4,5,8,12] there are two such subsequences: [3,4,5] (the difference is 1) and [4,8,12] (the difference is 4). Thus, the length of longest such subsequence is 3 for this example. Since $3 ge 5/4$, the answer is yes, there is a subsequence of length $ge N/4$ with the desired property. In my real-life situation, the sequence is of length $Napprox 10^6$, and the elements are all 9-digit numbers. Is there an efficient algorithm to solve this problem? My naive approach was to create Cartesian product with absolute differences between numbers: $$ left( begin{array}{ccccc} 0 & 1 & 2 & 5 & 9 1 & 0 & 1 & 4 & 8 2 & 1 & 0 & 3 & 7 5 & 4 & 3 & 0 & 4 9 & 8 & 7 & 4 & 0 end{array} right) $$ And then focus on top-right part and compute number of occurrences of each difference, so: $$ ||text{diff-by-1}|| = 2 => text{3 numbers diff by 1} ||text{diff-by-4}|| = 2 => text{3 numbers diff by 4} $$ This is very simple and very ineffective. It requires lot of comparisons and does not scale (at all): its running time is $Theta(N^2)$. In my real life scenario my sequence is ~10^6 long, so this is too slow. To give you wider picture as maybe there is much better (probabilistic) approach to this problem: after largest sub-sequence is found I want to compute simple ratio: $$ r:=frac{text{largest sub-sequence length}}{text{sequence length}} $$ and if $r$ is greater then some fixed value I want to raise alarm (or do whatever I have to do ;-)). Thanks for any help, references, pointers, etc. BTW: here are things that I was/am looking at:

Update: was thinking a little bit more about it and started from the end, so instead of computing all differences between numbers (top-right corner of the matrix) I can derive small $k$ value from “fixed value” I mentioned at the end of original question. For instance if I am going to raise the alarm when 25% of all numbers are in some sequence I need to focus on small “triangles” in matrix and number of computations required is smaller (much smaller). When I add some sampling then it should be simple enough to implement at scale. Update 2 – Implemented @D.W. algorithm, sample run below:

    11:51:06 ~$ time nodejs progression.js      L: 694000000,694000002,694000006,694000007,694000009,694000010,         694000013,694000015,694000018,694000019,694000021,694000022,694000023,     694000026,694000028,694000030,694000034,694000036,694000038,694000040,     694000043,694000045,694000046,694000048,694000051,694000053,694000055,     694000057,694000060,694000061,694000063,694000067,694000069,694000072,     694000074,694000076,694000077,694000079,694000080,694000082,694000083,     694000084,694000086,694000090,694000091,694000093,694000095,694000099,     694000102,694000103,694000105,694000108,694000109,694000113,694000116,     694000118,694000122,694000125,694000128,694000131,694000134,694000137,     694000141,694000143,694000145,694000148,694000152,694000153,694000154,     694000157,694000160,694000162,694000163,694000166,694000170,694000173,     694000174,694000177,694000179,694000180,694000181,694000184,694000185,     694000187,694000189,694000193,694000194,694000198,694000200,694000203,     694000207,694000211,694000215,694000219,694000222,694000226,694000228,     694000232,694000235,694000236     N: 100     P: 0.1     L: 10 (min)     D: 26 (max)     [ 9, 18, 27, 36, 45, 54, 63, 72, 81, 90 ]     Found progression of 10 elements, difference: 16 starts: 694000045, ends: 694000189.      real    0m0.065s     user    0m0.052s     sys 0m0.004s

Asked By : Maciej Łopaciński

Answered By : D.W.

You recently clarified that you want to solve the following problem:

Given an input sequence $L[0..N-1]$, check whether an arithmetic progression of length $ge N/4$ appears as a subsequence of $L$.

This is a significantly easier problem, and I can suggest a very efficient algorithm for you. Let me build up to it by first explaining a few key ideas underlying the algorithm. Idea 1. Given any two elements of $L$, it is easy to extend them to the maximum-length arithmetic progression containing those two elements. In particular, suppose we have $x,y in L$ with $x<y$. Set $d=y-x$. Then we can extend this sequence on the right as the longest prefix of $y+d,y+2d,y+3d,dots$ that is wholly contained in $L$, and we can extend it on the left with the longest prefix of $x-d,x-2d,x-3d,dots$ that is wholly contained in $L$. Idea 2. Say that an arithmetic progression has gap $d$ if the difference between consequence elements of the arithmetic progression is $d$. If $L$ contains an arithmetic progression of length $ge N/4$, then its gap $d$ must satisfy $d le 4(L[N-1]-L[0])/N$. Idea 3. Suppose there is an arithmetic progression of length $ge N/4$, and let $alpha,beta$ denote the start and end index (in $L$) of the subsequence. Since the subsequence has to be of length $ge N/4$, the interval $[alpha,beta]$ must contain one or more of the following numbers: $0.2N$, $0.4N$, $0.6N$, $0.8N$ (rounding all of them to an integer). To give you some intuition for this: Imagine you’re playing Battleship on a $N times 1$ board, and your opponent has a single ship: a super-long battleship, that is $N/4$ cells long. Could you find his battleship? You sure could. It’s only gonna take you 4 shots to find his battleship. You can shoot at cells $0.2N$, $0.4N$, $0.6N$, $0.8N$, and it’s guaranteed that at least one of them will be a hit (that’s Idea 3). Moreover, once you get a single hit, it’s not going to be hard to sink the rest of his battleship (that’s basically Ideas 1 and 2). With these ideas, I’m ready to propose an algorithm:

Set $d_max gets 4(L[N-1]-L[0])/N$.
For each $i_0 in {0.2N, 0.4N, 0.6N, 0.8N}$, do:
- For each $i$ such that $ige i_0$ and $L[i] le L[i_0]+d_max$, do:
  - For each $j$ such that $j>i$ and $L[j] le L[i]+d_max$, do:
    - Extend the arithmetic progression $L[i],L[j]$ on the left and right as far as possible (using Idea 1).
Look at the longest arithmetic progression found at any point above. If it has length $ge N/4$, then yes, there exists an arithmetic progression of length $ge N/4$. If its length is $< N/4$, then no, there is no arithmetic progression of length $ge N/4$.

Hopefully you can see why this works. Idea 3 means that, if there is an arithmetic progression of length $ge N/4$, then at least one of the indices $i$ that are visited in the above algorithm must be the index of one element of that progression. Moreover, the index of the next element of that progression must be one of the values taken on by $j$ in the innermost loop. Therefore, at some point the pair $L[i],L[j]$ will be a pair of consecutive elements of the arithmetic progression — which is enough to discover that long arithmetic progression. I would expect that this algorithm will typically be very efficient, on real-world data. If all of the integers in $L$ are in the range $[1,M]$, then we know $d_max le 4M/N$. Moreover, the loop over $i$ typically involves about $d_max N/M le 4$ iterations (on average), and similarly for the loop over $j$. So, we expect to perform maybe $5 times 4 times 4 = 80$ iterations of the innermost statement, regardless of $N$, if the data is randomly chosen (and in particular, not adversarily chosen). Consequently, this should be extremely efficient if the data doesn’t conspire against you. It is possible to further improve this algorithm, for instance, to improve its worst-case running time and reduce the opportunity for worst-case data to cause the running time to blow up, at the cost of making the code more complicated. However, I suspect that this simple algorithm might be sufficient for your needs.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/18951