In the first chapter Open Source data structure book Pat Morin mentions that
Now consider a company like Google, that indexes over 8.5 billion web
pages. By our calculations, doing any kind of query over this data would take at least 8.5
seconds. We already know that this isn’t the case; web searches complete in much less than
8.5 seconds, and they do much more complicated queries than just asking if a particular
page is in their list of indexed pages. At the time of writing, Google receives approximately
4;500 queries per second, meaning that they would require at least 4;500 8:5 = 38;250
very fast servers just to keep up
The above is possibly based on two strategies
- If a problem can be solved by a computer in X seconds and if there are Y such problems. To retain the performance (i.e. responding with a solution in X seconds) a strategy to employ Y computers should be theoretically possible. (Ignoring the latency in assigning the problem to a computer).
- If a problem can be solved by computer in X seconds and if its distributable i.e. if we can apply divide-and-conquer approach to solve the problem then the problem can possibly be solved in 1 second by employing X computers. (Ignoring the overhead of merging local solutions to solve the complete problem)
- Applying both the above if a computer takes X seconds and if there are Y such problems and if you want to solve the problem in 1 second, it can be done using X*Y computers.
The statement in the above source is inline with Amdahl’s law which is used to find the maximum expected improvement to an overall system when only part of the system is improved.
When applied to problems which can be parallelized it
if P is the proportion of a program that can be made parallel (i.e., benefit from parallelization), and (1 −P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is
In the limit, as N tends to infinity, the maximum speedup tends to 1 / (1 − P). In practice, performance to price ratio falls rapidly as N is increased once there is even a small component of (1 − P).
if for a given problem size a parallelized implementation of an algorithm can run 12% of the algorithm’s operations arbitrarily quickly (while the remaining 88% of the operations are not parallelizable), Amdahl’s law states that the maximum speedup of the parallelized version is 1/(1 – 0.12) = 1.136 times as fast as the non-parallelized implementation.
Further Optimization Strategy
- Share Knowledge: While each computer solves its piece of the puzzle (intentionally left it loosely defined since it could be subset of the problem or a solution of the problem on a local subset or some other form) it could be that it generates several statistics or makes several observations that it can share with other computers so that they can optimize their operations on the go.
A search problem for a a particular class such that all its instances abide by a set of conditions. If one of the computers – X finds that all instances of class A abide by the rule then it will start broadcasting Class A to other computers and other computers can quickly tell X if they have seen an exception instance and therefore help X ignore all new instances of A completely. Had this information not been shared then each computer would have continued to apply conditions on each instance of Class A until atleast they found an exception and in worst case it would only be found at the time or merging local search results.
- This (and other such possible strategies) are not in contradiction with Amdahl’s law in fact what they are saying is that the speedup due to parallelization need not be linear function.
- As this blog suggests the Amdahl law’s formula per se, has not been proposed by Amdahl himself and there are oversimplifications assumed by the formula which thus reduces its applicability to various real-life application use-cases including parallel-processing frameworks like hadoop. Even then it does seem to be a good point to start from.
- Also achieving linear scale in parallelizable tasks is considered to be tough problem in general given the inevitable overheads in breaking and sending the tasks and constantly monitoring, merging and formulating final results. The above optimization strategy doesn’t belittle this problem, instead suggests a possible way to spin it around to offset those overheads. (or possibly go beyond???)