Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the Creative Commons Attribution License . Many thanks to Aaron Kimball & Sierra Michels-Slettvet for their original version 12 Some Slides from : Jeff Dean, Sanjay Ghemawat http://labs./papers/ Motivation 3 ? 200+ processors ? 200+ terabyte database ? 10 10 total clock cycles ? second response time ?5¢ average advertising revenue From: /~bryant/presentations/DISC- Motivation: Large Scale Data Processing ? Want to process lots of data ( > 1 TB) ? Want to parallelize across hundreds/thousands of CPUs ?… Want to make this easy 4 "Google Earth uses TB : 70 TB for the raw imagery and 500 GB for the index data." From: http://googlesystem./2006/09/how- much-data-does-google- MapReduce ? Automatic parallelization & distribution ? Fault-tolerant ? Provides status and monitoring tools ? Clean abstraction for programmers 5 Programming Model ? Borrows from functional programming ? Users implement interface of two functions: ? map (in_key, in_value) -> (out_key, intermediate_value) list ? reduce (out_key, intermediate_value list) -> out_value list 6 map ? Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key * value pairs: ., (filename, line). ? map() produces one or more intermediate values along with an output key from the input. 7 reduce ? After the map phase is over, all the intermediate values for a given output key bined together into a list ? reduce() combines those intermediate values into one or more final values for that same output key ?(in practice, usually only one final value per key) 8 Architecture 9 Parallelism ? map() functions run in parallel, creating different intermediate values from different input data sets ? reduce() functions also run in parallel, each working on a diff