Distributed Analytics using Cluster

Often, our data is non-relational and huge
- Billions of query logs
- Billions of web pages
- ...
Q: Can we perform analytics on large data quickly using thousands of machies?

Log of billions of queries. Count frequency of each query
- Input query log:
  
  cat, time, userid1, ip1, referrer1
  
  dog, time, userid2, ip2, referrer2
  
  ...
- Output query frequency:
  
  cat 200000
  
  dog 120000
  
  ...
Q: How can we perform this task of counting? How can we parallelize it?

Step 1: "Transform" each line of query log into (query, 1)

Step 2: Collect all tuples with the same query and aggregate them
How can we parallelize those steps tho?

The transformation of each line can be done independently

Step 1: parallel processing
- split input data into multiple independent chunks
- Move each chunk to separate machine
- Perform "transformation" onmultiple machines in parallel
Step 2: Aggregatin
- Move the tuples with the same query to the same machine
- Perform aggregation on multiple machine in parallel

1 billion pages. Build "inverted index"
- Input documents:
  - 1: cat chase dogs
  - 2: dogs loves cat
  - ...
- Output index:
  - cat 1,2,5,10,20...
  - dog 1,2,3,8,9...
Step1: "Transform" every document into (word, doc_id) tuples
Step2: Collect all tuples with the same word and "aggregate" (or concatenate) the doc_id's
How can we parallelize the 2 steps on multiple machines?
- For the first step, we can separate the document to different machines to perform the "transformation"
- For the second step, we can move all tuples with the same word to the same machine and perform aggregation on multiple machine in parallel

Generalization

"Mapping Step": Input data consists of multiple independent units
- Ex: Each line of query log, each web page
- Partition input data into multiple "chunks" and distribute them to multiple machines
- Transformation/map input into (key, value) tuples
  - Query log: query_log_line → (query, 1)
  - Indexing: web_page → (word1, page_id), ...
"Reduction Step": Aggregate the tuples of same keys
- Ex:
  - Query log: (query, 1), (query, 1), .. → (query, count)
  - Indexing: (word, 1), (word, 3) , .. → (word, [1, 3, ...])

Programmer provides
1. Map function: "unit data" → (k', v'), (k'', v'')
2. Reduce function: (k, v1), (k, v2), ... → (k, aggr(v1, v2, ...))
MapReduce handles the rest
- Automatic data partition, distribution, and collection
- Failure and speed-disparity handling
Many systems exist supporting MapReduce model