Lesson 2: Interview Question - Design a High-Performance Payment System
Answer SLO and performance questions with confidence.
Lesson 2: Interview Question - Design a High-Performance Payment System
The Interview Question
“Design a payment processing system that can handle 1 million transactions per second with 99.99% availability and < 100ms latency.”
This question tests your understanding of:
- Performance requirements (SLOs)
- High availability
- Low latency systems
- Trade-offs between consistency and performance
Step 1: Clarify Requirements
You should ask:
- “What’s the transaction volume? Peak vs average?”
- “What’s the availability requirement? 99.9% or 99.99%?”
- “What’s the latency requirement? P95 or P99?”
- “What about consistency? Do we need strong consistency?”
Interviewer’s answer:
- “1M transactions/second at peak”
- “99.99% availability (four nines)”
- ”< 100ms p95 latency”
- “Strong consistency required (it’s money!)”
Step 2: Design with SLOs in Mind
This is where SLOs (Service Level Objectives) come in. Interviewers love when you think about measurable targets.
Let’s model the payment system with explicit SLOs:
import { * } from 'sruja.ai/stdlib'
PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
technology "Go, gRPC"
// This shows production-ready thinking!
slo {
availability {
target "99.99%"
window "30 days"
current "99.97%"
}
latency {
p95 "100ms"
p99 "250ms"
window "7 days"
current {
p95 "85ms"
p99 "200ms"
}
}
errorRate {
target "< 0.01%"
window "30 days"
current "0.008%"
}
throughput {
target "1000000 txn/s"
window "1 hour"
current "950000 txn/s"
}
}
scale {
min 100
max 10000
metric "cpu > 70% or requests_per_second > 500000"
}
}
FraudDetection = container "Fraud Detection" {
technology "Python, ML"
description "Real-time fraud detection"
}
PaymentDB = database "Payment Database" {
technology "PostgreSQL"
description "Primary database with 10 read replicas"
}
Cache = database "Payment Cache" {
technology "Redis"
description "Caches recent transactions"
}
PaymentQueue = queue "Payment Queue" {
technology "Kafka"
description "Async payment processing"
}
}
Stripe = system "Stripe Gateway" {
tags ["external"]
}
BankAPI = system "Bank API" {
tags ["external"]
}
PaymentService.PaymentAPI -> PaymentService.FraudDetection "Validates"
PaymentService.PaymentAPI -> PaymentService.Cache "Checks recent transactions"
PaymentService.PaymentAPI -> PaymentService.PaymentDB "Stores transaction"
PaymentService.PaymentAPI -> PaymentService.PaymentQueue "Enqueues for async processing"
PaymentService.PaymentAPI -> Stripe "Processes payment"
PaymentService.PaymentAPI -> BankAPI "Validates with bank"
view index {
include *
}
What Interviewers Look For
✅ Good Answer (What You Just Did)
- Defined SLOs explicitly - Shows you think about measurable targets
- Addressed all requirements - Availability, latency, throughput
- Explained trade-offs - Strong consistency vs performance
- Scalability - Showed how to handle 1M txn/s
- Redundancy - Multiple replicas, failover strategies
❌ Bad Answer (Common Mistakes)
- Not defining SLOs or performance targets
- Ignoring availability requirements
- Not explaining how to achieve 99.99% availability
- Not addressing consistency requirements
- No capacity estimation
Key Points to Mention in Interview
1. Availability (99.99% = Four Nines)
Say: “99.99% availability means 52.6 minutes of downtime per year. To achieve this, we need:
- Multiple data centers (active-active)
- Automatic failover
- Health checks and monitoring
- Database replication with automatic promotion”
2. Latency (< 100ms p95)
Say: “To achieve < 100ms latency, we:
- Use in-memory cache (Redis) for hot data
- Keep database queries simple and indexed
- Use connection pooling
- Minimize network hops
- Consider async processing for non-critical paths”
3. Throughput (1M txn/s)
Say: “To handle 1M transactions/second:
- Horizontal scaling: 100-10,000 API instances
- Database sharding by transaction ID
- Read replicas for scaling reads
- Caching frequently accessed data
- Async processing for non-critical operations”
4. Strong Consistency
Say: “Since this is financial data, we need strong consistency:
- All writes go to primary database
- Read replicas are eventually consistent (ok for reads)
- Use distributed transactions for critical operations
- Trade-off: Slightly higher latency for correctness”
Understanding SLO Types (Interview Context)
Availability SLO
Interviewer asks: “How do you ensure 99.99% availability?”
Your answer with SLO:
import { * } from 'sruja.ai/stdlib'
PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
slo {
availability {
target "99.99%"
window "30 days"
current "99.97%"
}
}
}
}
view index {
include *
}
Explain: “We target 99.99% (four nines), which allows 52.6 minutes downtime per year. Currently at 99.97%, so we’re close but need to improve redundancy.”
Latency SLO
Interviewer asks: “How fast should payments process?”
Your answer with SLO:
import { * } from 'sruja.ai/stdlib'
PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
slo {
latency {
p95 "100ms"
p99 "250ms"
window "7 days"
}
}
}
}
view index {
include *
}
Explain: “95% of payments complete in under 100ms, 99% in under 250ms. We use p95/p99 instead of average because they show real user experience - a few slow payments don’t skew the metric.”
Error Rate SLO
Interviewer asks: “What error rate is acceptable?”
Your answer with SLO:
import { * } from 'sruja.ai/stdlib'
PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
slo {
errorRate {
target "< 0.01%"
window "30 days"
current "0.008%"
}
}
}
}
view index {
include *
}
Explain: “We target less than 0.01% error rate. Currently at 0.008%, which is good, but we monitor closely because payment errors are critical.”
Real Interview Example: Capacity Estimation
Interviewer: “How many servers do you need for 1M txn/s?”
Your answer:
- “Each transaction requires ~10ms processing = 100 transactions/second per server”
- “1M txn/s ÷ 100 = 10,000 servers needed”
- “With 2x headroom for spikes and redundancy: ~20,000 servers”
- “But we can optimize:
- Caching reduces DB load → fewer DB servers
- Async processing → can batch operations
- Database sharding → distributes load
- Final estimate: ~5,000-10,000 servers”
Interview Practice: Add High Availability
Interviewer: “How do you ensure 99.99% availability?”
Add redundancy to your design:
import { * } from 'sruja.ai/stdlib'
PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
technology "Go, gRPC"
scale {
min 100
max 10000
metric "cpu > 70%"
}
description "Deployed across 3 data centers (active-active)"
}
PaymentDB = database "Payment Database" {
technology "PostgreSQL"
description "Primary in US-East, replicas in US-West and EU"
}
}
// Show redundancy
PaymentService.PaymentAPI -> PaymentService.PaymentDB "Writes to primary"
view index {
include *
}
Explain: “We deploy across 3 data centers in active-active mode. If one fails, traffic automatically routes to others. Database has primary + replicas with automatic failover.”
Common Follow-Up Questions
Be prepared for:
-
“What if the database fails?”
- Answer: “Automatic failover to replica, data replication with < 1s lag”
-
“How do you handle network partitions?”
- Answer: “CAP theorem - we choose consistency over availability for payments. If partition occurs, we reject transactions rather than risk inconsistency.”
-
“What about data consistency across regions?”
- Answer: “Synchronous replication for critical data, eventual consistency for non-critical. Use distributed transactions for cross-region operations.”
-
“How do you monitor SLOs?”
- Answer: “Real-time dashboards showing current vs target SLOs. Alerts when we’re at risk of violating SLOs. Weekly reviews of SLO performance.”
Exercise: Practice This Question
Design a payment system and be ready to explain:
- How you achieve 99.99% availability
- How you keep latency < 100ms
- How you handle 1M txn/s
- Your SLO targets and how you measure them
Practice tip: Time yourself (40-45 minutes) and explain out loud. Focus on SLOs - interviewers love this!
Key Takeaways for Interviews
- Always define SLOs - Shows production-ready thinking
- Explain trade-offs - Availability vs consistency, latency vs throughput
- Show capacity estimation - Back up your numbers
- Mention monitoring - How you track SLOs
- Discuss failure scenarios - What happens when things break
Next Steps
You’ve learned how to handle performance and SLO questions. In the next module, we’ll tackle modular architecture questions - another common interview topic!