Loading...
Loading...
Learn how to design a blob storage system like S3 in interviews — covering partitioning, metadata separation, replication, and the 10 mistakes that cost candidates the offer.
Here's the thing — blob storage is one of those system design questions that feels deceptively simple at first. "Just store some files, right?" And then the interviewer asks how you'd handle 10 billion objects across thousands of nodes, and suddenly you're staring at the whiteboard like a deer in headlights.
I've seen this exact moment happen in hundreds of interviews. Candidates who crushed LeetCode and could recite CAP theorem cold — completely froze when asked to design something like AWS S3 or Azure Blob Storage.
The good news? Once you understand the core patterns, blob storage becomes one of the most predictable system design interviews out there. The same concepts come up every single time. This guide is going to walk you through all of them, including the 10 mistakes I see most often and exactly how to avoid them.
Before we dive in, let me give you some insider context. When an interviewer asks you to design blob storage, they're not just checking if you know what S3 is. They're looking for specific signals:
A strong answer hits all four. A weak answer treats the system like a file server that just needs "a load balancer and a database."
Let's build the mental model before we go deep. At a high level, every blob storage system does five things:
That separation between metadata and data in step 3? That's the single most important concept in this entire interview. Write it down. We'll come back to it.
When a client uploads a blob, here's what happens:
Client → API Gateway → Load Balancer → Frontend Server
↓
Auth + Quota Check
↓
Metadata Service (assigns storage location)
↓
Data Node (blob is written)
↓
Metadata Updated (blobId → node location)Client requests blob by key
↓
Metadata Service looks up location
↓
Frontend redirects or proxies to correct Data Node
↓
Blob is streamed back to clientWhen you explain this in an interview, walk through both flows explicitly. Interviewers love when candidates can narrate a complete request lifecycle — it shows you've actually thought about the system end-to-end.
Here's what the interviewer is really checking when they ask "how would you store billions of objects?": do you understand sharding?
Blob storage works by distributing data across many nodes. The partition key determines which node stores a given blob. The most common approach is consistent hashing:
# Simplified partition key assignment
def get_partition(blob_key: str, num_partitions: int) -> int:
hash_value = hashlib.md5(blob_key.encode()).hexdigest()
return int(hash_value, 16) % num_partitions
# Usage
blob_key = f"{account_id}/{container_id}/{blob_id}"
partition = get_partition(blob_key, num_partitions=1024)Here's a common trap: candidates confuse the blob ID with the partition key. The blob ID identifies the blob. The partition key locates it. These are different things. A blob might have the ID img_abc123, but it lives on partition 47 — and you find partition 47 by hashing the blob key, not by reading the ID.
Why does this matter? Sequential keys (like auto-incrementing integers) create hotspots — all new uploads pile onto the same partition. You fix this with hashing or by prefixing keys with a random salt.
This is the architectural insight that separates intermediate candidates from senior ones.
| Property | Metadata Store | Data Store |
|---|---|---|
| Size | Small (bytes per record) | Large (MBs to GBs per blob) |
| Access Pattern | Random reads, fast lookup | Sequential reads, high throughput |
| Storage Type | Relational DB or KV store | Distributed object storage |
| Consistency Needs | Strong | Eventual is usually fine |
The metadata service stores things like: blob name, size, content type, storage location, version, checksum, and access controls. The data nodes store the actual binary content.
Why split them? Because a 10-byte metadata lookup and a 4GB video download have completely different performance characteristics. Treating them the same way is like storing your index cards in the same drawer as your TV.
This is non-negotiable in interviews. If you don't mention replication, the interviewer loses confidence immediately.
The standard approach: every blob is replicated 3 times, ideally across different failure domains (machines → racks → availability zones). This is how systems like S3 achieve "11 nines" of durability (99.999999999%).
When a blob is written, you typically need at least 2 out of 3 replicas to acknowledge before confirming success to the client. This is your write quorum.
Here's the thing most people miss — you can't upload a 5GB file as a single HTTP request. Networks fail. Connections drop. You need multipart upload:
# Multipart upload flow
def multipart_upload(file_path: str, blob_key: str, chunk_size_mb: int = 8):
# Step 1: Initiate upload, get upload_id
upload_id = storage_client.initiate_multipart_upload(blob_key)
parts = []
chunk_size = chunk_size_mb * 1024 * 1024
with open(file_path, 'rb') as f:
part_number = 1
while chunk := f.read(chunk_size):
# Step 2: Upload each part independently (retryable)
etag = storage_client.upload_part(
The key insight: each chunk can be retried independently if it fails. The client doesn't restart the entire upload — just the failed chunk. This is what makes large file uploads reliable over flaky networks.
For hot content (profile pictures, popular videos), you put a CDN in front. The CDN caches blobs at edge nodes close to users, dramatically reducing latency and offloading your storage nodes.
For partial downloads (think: video seeking), support range reads via HTTP Range headers: Range: bytes=1048576-2097151. This lets a video player fetch only the section of a file it needs — you don't have to download a 2GB file to watch minute 47.
Don't just say "add rate limiting." Explain the layers:
A single global rate limiter is a bottleneck and a single point of failure. Layered rate limiting is the right answer.
If the interviewer asks "how do you list all blobs in a bucket with a billion objects?", do NOT say offset pagination. Here's why:
N rows — that's O(N) work per pageInstead, use continuation tokens: an opaque cursor that encodes the last-seen position. The system uses this to resume from exactly the right place, regardless of what changed in between.
When you draw this on the whiteboard, hit these components in order:
[Client]
↓
[CDN] ← (cache hot blobs at edge)
↓
[API Gateway] ← (auth, rate limiting, routing)
↓
[Load Balancer]
↓
[Frontend Servers] ← (quota checks, request validation)
↓ ↓
[Metadata Service] [Data Nodes (Partitioned)]
[e.g., Cassandra, [e.g., Shard 1, Shard 2...N]
distributed KV]
↓
[Replication Workers]
[Background GC / Cleanup]Start simple, then layer on the CDN, background workers, and replication. Never jump to the full diagram immediately — build it incrementally. This shows structured thinking.
Let me be direct about the most common failure modes I see:
blobId, you know where the data is. You still need the metadata service to map that ID to a physical node.Here's the exact phrasing I coach candidates to use:
Opening the design:
"Before I start drawing, let me clarify the requirements. Are we optimizing for read-heavy or write-heavy workloads? What's the expected object size distribution? Do we need strong consistency, or is eventual consistency acceptable for reads?"
Introducing partitioning:
"To handle this at scale, I'd partition blobs across nodes using a hash of the blob key. This gives us uniform distribution and avoids the hotspot problem you'd get with sequential keys."
Explaining metadata separation:
"One important design decision is separating the control plane from the data plane. Metadata — things like blob location, size, and checksum — lives in a fast, consistent store like Cassandra or a distributed KV. The actual binary data lives on storage nodes optimized for high throughput. This separation lets us optimize each layer independently."
Handling the follow-up on failures:
"If a storage node fails, our health check service detects it within seconds. We then trigger background re-replication — any blobs that were on that node get copied to healthy nodes to restore our target replication factor of 3."
The interviewer will almost certainly ask at least one of these:
These are the things that make interviewers quietly move a candidate to the "no" pile:
Here's what to burn into memory before your interview:
Nail these seven points and you'll be in the top 10% of candidates on this question. The interviewer doesn't expect you to design the next S3 from scratch — they expect you to reason clearly about scale, tradeoffs, and failure. That's exactly what this framework gives you.