In the previous chapter, Graph Relationships and Search, we learned how to map connections between data, like finding the shortest path between two friends.
But what if we aren't just looking for one path? What if we need to analyze everything?
Imagine you have a stack of 1,000,000 receipts from a grocery store, and you need to calculate the total sales for "Bananas."
If you do this alone (one computer), it might take a week. But if you hire 1,000 friends (machines) and give each friend 1,000 receipts, you can finish in minutes.
Distributed Data Processing is the art of breaking a massive task into tiny pieces, processing them in parallel on many machines, and combining the results.
Let's look back at our Pastebin example from System Architecture Design.
Every time someone visits a paste (like pastebin.com/AbC12), we save a log entry:
2023-10-27 10:00:01, /AbC12
2023-10-27 10:00:02, /XyZ99
2023-10-27 10:00:03, /AbC12
... (1 Billion more lines) ...
Goal: We want to know: "How many times was /AbC12 visited in October?"
We cannot load a 1-billion-line file into memory. We need a strategy called MapReduce.
MapReduce is a programming model Google made famous. It works like a kitchen brigade.
Imagine we give a page of logs to a worker. The worker's only job is to look at a line and say, "I saw /AbC12 one time."
Input: A line of text.
Output: A Key-Value pair (URL, 1).
The system groups all the results by Key. All the /AbC12 notes go to one pile. All the /XyZ99 notes go to another pile.
A different worker takes the /AbC12 pile. They don't need to read the original logs. They just look at the notes: "1, 1, 1... okay, that's 3 total."
Here is how data flows through the system:
We will use a Python library called mrjob which hides the complex messy parts (like sending data over networks) and lets us focus on the logic.
The Mapper looks at one line at a time. It extracts the data we care about (the URL) and "yields" a count of 1.
def mapper(self, _, line):
# Line looks like: "2023-10-27, /AbC12"
data = line.split(',')
url = data[1].strip()
# We output: (Key, Value)
yield url, 1
Explanation: We don't try to add anything up yet. We just shout, "I found one /AbC12!"
The Reducer receives a key (the URL) and a list of all the "1"s found by all the mappers.
def reducer(self, key, values):
# key = "/AbC12"
# values = [1, 1, 1, 1, ...]
total_hits = sum(values)
yield key, total_hits
Explanation: The sum() function adds up the list. If 500 different workers each found the URL once, the list has 500 ones. The sum is 500.
Here is the complete code structure (based on solutions/system_design/pastebin/pastebin.py).
from mrjob.job import MRJob
class HitCounts(MRJob):
def mapper(self, _, line):
# extract_url is a helper function to parse text
url = self.extract_url(line)
yield url, 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
HitCounts.run()
Explanation: This script can be run on your laptop for small files, or sent to a cluster (like Amazon EMR) to run on 1,000 servers automatically.
You might wonder: "Who coordinates all these workers?"
This architecture requires a Master Node (The Manager).
What if we want to sort items? For example, finding the "Best Selling Product." This requires Chaining jobs.
Here is how a simplified SalesRanker works (from solutions/system_design/sales_rank/sales_rank_mapreduce.py):
def steps(self):
# We tell the system to run two steps in order
return [
self.mr(mapper=self.mapper_sum,
reducer=self.reducer_sum),
self.mr(mapper=self.mapper_sort,
reducer=self.reducer_identity)
]
Explanation: The output of step 1 becomes the input of step 2. This is how we build complex data pipelines.
MapReduce sorts keys automatically between Map and Reduce. If we want to sort by sales numbers, we make the "Number of Sales" the key.
def mapper_sort(self, product_id, total_sales):
# Make the sales count the KEY
# Format: (Sales Count, Product Name)
yield total_sales, product_id
Explanation: Because the system sorts keys, the products will arrive at the reducer in order (e.g., 10 sales, then 20 sales, then 100 sales).
In this tutorial series, we have traveled from the basics to advanced scale.
Congratulations! You now understand the fundamental building blocks used by software architects to design systems like Google, Amazon, and Facebook.
While specific tools (like Kubernetes, Kafka, or React) will change over time, these core concepts—Architecture, Data Modeling, Caching, Graph Theory, and Distributed Processing—remain the foundation of modern software engineering.
Generated by Code IQ