Google's scaling strategy was integral to their search dominance and the subsequent avatat as a major player in this net-based software era. Urs Hoelzlespoke about this at Eclipse Con 2005.
Economical,commodity hardware,not high-end servers,running on Linux is the unique infrastructure used in Google.Lowers cost/CPU allowing for greater redundancy. But, this increases potential maintenance. In a big setup,expected hardware failure at any rate makes efficient responses to these failures less than trivial. Redundancy is abundant inside. Not losing data is central to Google's business so, it makes sense.Scalabale and reliable infrastructure building blocks have been built and the key thing is to make these racks of hardware work together and to ensure that the failure of one machine doesn't derail an operation.
As Benjamin Booth points out - to achieve this Google realized several useful abstractions:
Google File System (GFS)
• GFS Master manages metadata; these are replicated
• 64 MB file 'chunks' are managed Chunkservers, also replicated 3X
• Chunks also triplicated for fault tolerance.
• GFS client servers directly access the GFS Master and Chunkservers
Basic Computing Cluster
• Needed massive parallelization and distribution that are easy to use
• MapReduce solves the problem. MapReducing = mapping + reduction.
Map: take input k/v and produce set of intermediate k/v pairs
Reduce: emit final, condensed k/v pair - these are sorted, merged search results
MapReduction is so redundant that, in one unplanned test, they lost 90% of their reduction 'worker' servers and all of the reduction tasks still completed providing good fault tolerance! Google's scaling approach is highly effective.How Google Works, provides more authentic information.