Log of billions of queries. Count frequency of each query
Input query log:
cat, time, userid1, ip1, referrer1
dog, time, userid2, ip2, referrer2
...
Output query frequency:
cat 200000
dog 120000
...
Q: How can we perform this task of counting? How can we parallelize it?
Step 1: "Transform" each line of query log into (query, 1)
Step 2: Collect all tuples with the same query and aggregate them
How can we parallelize those steps tho?
The transformation of each line can be done independently
Step 1: parallel processing
Step 2: Aggregatin