Lesson 3: Availability & Reliability

Redundancy, Failover, and SLAs.

Lesson 3: Availability & Reliability

Reliability vs. Availability

  • Reliability: The probability that a system will function correctly without failure for a specified period. It’s about correctness.
  • Availability: The percentage of time a system is operational and accessible. It’s about uptime.

A system can be available but not reliable (e.g., it returns 500 errors but is “up”).

Measuring Availability

Availability is often measured in “nines”:

AvailabilityDowntime per Year
99% (Two nines)3.65 days
99.9% (Three nines)8.76 hours
99.99% (Four nines)52.6 minutes
99.999% (Five nines)5.26 minutes

Achieving High Availability

Redundancy

The key to availability is eliminating Single Points of Failure (SPOF). This is done via redundancy.

  • Active-Passive: One server handles traffic; the other is on standby.
  • Active-Active: Both servers handle traffic. If one fails, the other takes over the full load.

Failover

The process of switching to a redundant system upon failure. This can be manual or automatic.


🛠️ Sruja Perspective: Modeling Redundancy

You can explicitly model redundant components in Sruja to visualize your high-availability strategy.

import { * } from 'sruja.ai/stdlib'


Payments = system "Payment System" {
    PaymentService = container "Payment Service" {
        technology "Java"
    }

    // Modeling a primary and standby database
    PrimaryDB = database "Primary Database" {
        technology "MySQL"
        tags ["primary"]
    }

    StandbyDB = database "Standby Database" {
        technology "MySQL"
        tags ["standby"]
        description "Replicates from PrimaryDB. Promoted to primary if PrimaryDB fails."
    }

    PaymentService -> PrimaryDB "Reads/Writes"
    PrimaryDB -> StandbyDB "Replicates data"
}

view index {
include *
}