Asked By : LuckyB
Answered By : Evil
From vast amount of places: cache usage (what fit into registers, main memory or mass storage, where very often more processing units gives overall more registers per subtask), memory hit patterns, simply better (or a slight different) algorithm, flaws in the serial code.
For example random process that searches space for result is now divided into $1024$ searchers covering more space at once so finding solution faster is more probable.
There are byproducts (if you swap elements like in bubble sort and switch into gpu it swaps all pairs at once, while serial only up to point). On the distributed system communication is even more costly, so programs are changed to make memory usage local (which also changes memory access, divides problem differently than in sequential application). And the most important, the sequential program is not ideally the same as parallel version – different technology, environment, algorithm etc. so it is hard to compare them. Excerpt from “Introduction to Parallel Computing” Second edition by Ananth Grama, 2003
Theoretically speedup can never exceed the number of processing elements $p$.
If the best sequential algorithm takes $T_s$ units of time to solve a given problem on a single processing element, then a speedup of $p$ can be obtained on $p$ processing elements if none of them spends more than time $T_s/p$.
A speedup greater than $p$ is possible only if each processing element spends less than time $T_s/p$ solving the problem.
In this case, a single processing element could emulate the $p$ processing elements and solve the problem in fewer than $T_s$ units of time.
This is a contradiction because speedup, by definition is computed with respect to the best sequential algorithm.
So the name “superlinear” in this context comes from definition of speedup.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/55433 Ask a Question Download Related Notes/Documents