Lesson 2: Interview Question - Design a High-Performance Payment System

Answer SLO and performance questions with confidence.

Lesson 2: Interview Question - Design a High-Performance Payment System

The Interview Question

“Design a payment processing system that can handle 1 million transactions per second with 99.99% availability and < 100ms latency.”

This question tests your understanding of:

Performance requirements (SLOs)
High availability
Low latency systems
Trade-offs between consistency and performance

Step 1: Clarify Requirements

You should ask:

“What’s the transaction volume? Peak vs average?”
“What’s the availability requirement? 99.9% or 99.99%?”
“What’s the latency requirement? P95 or P99?”
“What about consistency? Do we need strong consistency?”

Interviewer’s answer:

“1M transactions/second at peak”
“99.99% availability (four nines)”
”< 100ms p95 latency”
“Strong consistency required (it’s money!)”

Step 2: Design with SLOs in Mind

This is where SLOs (Service Level Objectives) come in. Interviewers love when you think about measurable targets.

Let’s model the payment system with explicit SLOs:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  technology "Go, gRPC"

  // This shows production-ready thinking!
  slo {
    availability {
      target "99.99%"
      window "30 days"
      current "99.97%"
    }

    latency {
      p95 "100ms"
      p99 "250ms"
      window "7 days"
      current {
        p95 "85ms"
        p99 "200ms"
      }
    }

    errorRate {
      target "< 0.01%"
      window "30 days"
      current "0.008%"
    }

    throughput {
      target "1000000 txn/s"
      window "1 hour"
      current "950000 txn/s"
    }
  }

  scale {
    min 100
    max 10000
    metric "cpu > 70% or requests_per_second > 500000"
  }
}

FraudDetection = container "Fraud Detection" {
  technology "Python, ML"
  description "Real-time fraud detection"
}

PaymentDB = database "Payment Database" {
  technology "PostgreSQL"
  description "Primary database with 10 read replicas"
}

Cache = database "Payment Cache" {
  technology "Redis"
  description "Caches recent transactions"
}

PaymentQueue = queue "Payment Queue" {
  technology "Kafka"
  description "Async payment processing"
}
}

Stripe = system "Stripe Gateway" {
tags ["external"]
}

BankAPI = system "Bank API" {
tags ["external"]
}

PaymentService.PaymentAPI -> PaymentService.FraudDetection "Validates"
PaymentService.PaymentAPI -> PaymentService.Cache "Checks recent transactions"
PaymentService.PaymentAPI -> PaymentService.PaymentDB "Stores transaction"
PaymentService.PaymentAPI -> PaymentService.PaymentQueue "Enqueues for async processing"
PaymentService.PaymentAPI -> Stripe "Processes payment"
PaymentService.PaymentAPI -> BankAPI "Validates with bank"

view index {
include *
}

What Interviewers Look For

✅ Good Answer (What You Just Did)

Defined SLOs explicitly - Shows you think about measurable targets
Addressed all requirements - Availability, latency, throughput
Explained trade-offs - Strong consistency vs performance
Scalability - Showed how to handle 1M txn/s
Redundancy - Multiple replicas, failover strategies

❌ Bad Answer (Common Mistakes)

Not defining SLOs or performance targets
Ignoring availability requirements
Not explaining how to achieve 99.99% availability
Not addressing consistency requirements
No capacity estimation

Key Points to Mention in Interview

1. Availability (99.99% = Four Nines)

Say: “99.99% availability means 52.6 minutes of downtime per year. To achieve this, we need:

Multiple data centers (active-active)
Automatic failover
Health checks and monitoring
Database replication with automatic promotion”

2. Latency (< 100ms p95)

Say: “To achieve < 100ms latency, we:

Use in-memory cache (Redis) for hot data
Keep database queries simple and indexed
Use connection pooling
Minimize network hops
Consider async processing for non-critical paths”

3. Throughput (1M txn/s)

Say: “To handle 1M transactions/second:

Horizontal scaling: 100-10,000 API instances
Database sharding by transaction ID
Read replicas for scaling reads
Caching frequently accessed data
Async processing for non-critical operations”

4. Strong Consistency

Say: “Since this is financial data, we need strong consistency:

All writes go to primary database
Read replicas are eventually consistent (ok for reads)
Use distributed transactions for critical operations
Trade-off: Slightly higher latency for correctness”

Understanding SLO Types (Interview Context)

Availability SLO

Interviewer asks: “How do you ensure 99.99% availability?”

Your answer with SLO:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  slo {
    availability {
      target "99.99%"
      window "30 days"
      current "99.97%"
    }
  }
}
}

view index {
include *
}

Explain: “We target 99.99% (four nines), which allows 52.6 minutes downtime per year. Currently at 99.97%, so we’re close but need to improve redundancy.”

Latency SLO

Interviewer asks: “How fast should payments process?”

Your answer with SLO:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  slo {
    latency {
      p95 "100ms"
      p99 "250ms"
      window "7 days"
    }
  }
}
}

view index {
include *
}

Explain: “95% of payments complete in under 100ms, 99% in under 250ms. We use p95/p99 instead of average because they show real user experience - a few slow payments don’t skew the metric.”

Error Rate SLO

Interviewer asks: “What error rate is acceptable?”

Your answer with SLO:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  slo {
    errorRate {
      target "< 0.01%"
      window "30 days"
      current "0.008%"
    }
  }
}
}

view index {
include *
}

Explain: “We target less than 0.01% error rate. Currently at 0.008%, which is good, but we monitor closely because payment errors are critical.”

Real Interview Example: Capacity Estimation

Interviewer: “How many servers do you need for 1M txn/s?”

Your answer:

“Each transaction requires ~10ms processing = 100 transactions/second per server”
“1M txn/s ÷ 100 = 10,000 servers needed”
“With 2x headroom for spikes and redundancy: ~20,000 servers”
“But we can optimize:
- Caching reduces DB load → fewer DB servers
- Async processing → can batch operations
- Database sharding → distributes load
- Final estimate: ~5,000-10,000 servers”

Interview Practice: Add High Availability

Interviewer: “How do you ensure 99.99% availability?”

Add redundancy to your design:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  technology "Go, gRPC"
  scale {
    min 100
    max 10000
    metric "cpu > 70%"
  }
  description "Deployed across 3 data centers (active-active)"
}

PaymentDB = database "Payment Database" {
  technology "PostgreSQL"
  description "Primary in US-East, replicas in US-West and EU"
}
}

// Show redundancy
PaymentService.PaymentAPI -> PaymentService.PaymentDB "Writes to primary"

view index {
include *
}

Explain: “We deploy across 3 data centers in active-active mode. If one fails, traffic automatically routes to others. Database has primary + replicas with automatic failover.”

Common Follow-Up Questions

Be prepared for:

“What if the database fails?”
- Answer: “Automatic failover to replica, data replication with < 1s lag”
“How do you handle network partitions?”
- Answer: “CAP theorem - we choose consistency over availability for payments. If partition occurs, we reject transactions rather than risk inconsistency.”
“What about data consistency across regions?”
- Answer: “Synchronous replication for critical data, eventual consistency for non-critical. Use distributed transactions for cross-region operations.”
“How do you monitor SLOs?”
- Answer: “Real-time dashboards showing current vs target SLOs. Alerts when we’re at risk of violating SLOs. Weekly reviews of SLO performance.”

Exercise: Practice This Question

Design a payment system and be ready to explain:

How you achieve 99.99% availability
How you keep latency < 100ms
How you handle 1M txn/s
Your SLO targets and how you measure them

Practice tip: Time yourself (40-45 minutes) and explain out loud. Focus on SLOs - interviewers love this!

Key Takeaways for Interviews

Always define SLOs - Shows production-ready thinking
Explain trade-offs - Availability vs consistency, latency vs throughput
Show capacity estimation - Back up your numbers
Mention monitoring - How you track SLOs
Discuss failure scenarios - What happens when things break

Next Steps

You’ve learned how to handle performance and SLO questions. In the next module, we’ll tackle modular architecture questions - another common interview topic!