[Solved]: Largest reported speedup for a parallel computation?

Problem Detail: A parallel implementation of a computation is usually compared to a sequential version of the same computation. The ratio of the time taken by the sequential version to the time taken by the parallel version is called the speedup. So if 8 cores run their smaller parts of the computation in 2 time units, and one core runs the whole computation in 8 time units, then the speedup is 4.

What is the largest speedup reported for a real computation?

It is possible to reach essentially infinite speedup in a search problem, since one of the parallel pieces of the search space may lead to a fast solution by that parallel instance, while the sequential solution has to work through the entire search space to get to that point. More generally, I want to exclude any problem where one of the parallel processes can reach a shortcut. So I am only interested in computations where the amount of work done by the parallel processes is the same as done by the sequential process. This is common in solving PDEs by grid methods, or in discrete event simulation. So with $n$ processors, one should never get more than $n$ speedup for these kinds of problems. I would also like to exclude embarrassingly parallel problems like parallel rendering, since there one really has a vast number of independent tiny problems. I am interested in problems where it is not possible to partition the computation into strictly disjoint pieces. For a large speedup, one has to have many processors. Given the restrictions on scope that I have conveniently labelled as “real computations”, this question is then essentially about how efficiently the very large processor arrays that exist have been programmed. I am aware of reported speedups of ~500 using arrays of GPUs, but surely larger speedups exist. Edit: To address some of the comments, I dug up some further motivation, which will hopefully be precise enough for the tastes of those more mathematically inclined. This is quite different in style from the above, so I append it here as a postscript. For $n$ iid random variables $X_1, X_2,dots, X_n$ with mean $mu$, denote their maximum by $X_{(n)}$ and their sum by $S_n$. O’Brien has shown that $X_{(n)}/S_n to 0$ almost surely as $n to infty$ iff $mu < infty$. Letting $X_i$ be the time taken by the $i$-th processor to complete its task, and assuming that there is a timeout/recompute mechanism to ensure that $mu$ is finite, this hints that the inverse of the speedup should be essentially unbounded. (This is not necessarily the case: the techniques used may not carry over to the inverse, so I have to leave this a bit vague.) This is a nice theoretical prediction, and the question arises: is this prediction borne out in practice? Or do implicit dependencies or diverging behaviours of the different processors (i.e. a breakdown in the iid assumption) tend to curtail this supposedly unbounded increase? The iid case corresponds to the embarrassingly parallel case. Where the processors have to synchronize, independence breaks down. My question can therefore also be rephrased as: how badly does non-negligible dependence between parts of a computation as seen in practice affect the large-scale speedups that have been demonstrated? Given that the bounds for expectation of the maximum in the non-independent case are quite weak, some pointers to empirical data would be useful.

Asked By : András Salamon

Answered By : Massimo Cafaro

John Gustafson observed and reported speedups in excess of 1024 on early 80’s supercomputers; this led him to the concept of scaled speedup (Gustafson-Barsis law), in contrast to the pessimistic Amdahl-Ware law. Right now, in the era of multicore parallel supercomputers equipped with hundreds of thousands or millions of cores, performances are more commonly reported in terms of petaflops and efficiency. For instance, the winners of the last Gordon Bell prize at the Supercomputing 2012 conference, established the current world record for an astrophysical N-body simulation of one trillion particles performed on the full K computer, which appears to be number 4 in the June 2013 top 500 list. They reported 4.45 petaflops on 663,552 cores (the K computer was equipped with 82944 8-core processors in November 2012), and an efficiency (the ratio speedup/cores) of 42%. Translated in terms of speedup, that means a speedup of about 278,691.
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/12826  Ask a Question  Download Related Notes/Documents