At Scripbox, we're constantly working with external APIs—payment gateways, KYC services, market data providers, and various financial services. While these integrations power our platform, they also introduce a critical challenge: what happens when these external services slow down or fail?

This is where the circuit breaker pattern becomes essential. Just like electrical circuit breakers protect homes from power surges, software circuit breakers protect our applications from cascading failures when dependencies become unreliable.

The Problem: Cascading Failures

Picture this scenario: Our KYC verification service normally responds in 200ms, but suddenly starts taking 30 seconds per request due to issues on their end. Without protection, our threads get blocked waiting for these slow responses. As more requests pile up, our application server runs out of available threads, and suddenly our entire platform becomes unresponsive.

This cascading failure pattern is all too common in distributed systems. A single slow dependency can bring down an entire application, affecting users who aren't even using the problematic feature.

Enter the Circuit Breaker Pattern

Martin Fowler's circuit breaker pattern offers an elegant solution. The concept is simple: monitor calls to external services, and when failures reach a threshold, "open" the circuit to prevent further calls. After a cooling-off period, allow a few test calls to see if the service has recovered.

The circuit breaker has three states:

Closed: Normal operation, calls pass through

Open: Circuit is tripped, calls fail fast

Half-Open: Testing if the service has recovered

Why Build Our Own?

While several circuit breaker libraries exist for Ruby, we need something that fits our specific requirements:

Simplicity: We want a lightweight library that doesn't add unnecessary complexity to our codebase.

Flexibility: Different services have different characteristics. Our payment gateway might need different thresholds than our market data API.

Fine-grained Control: We need configurable failure thresholds, timeout values, and retry intervals that match our specific use cases.

Transparency: Clear state transitions and behavior that we can easily debug and monitor.

Designing Our Circuit Breaker

Our implementation centers around a CircuitBreaker::Shield class that wraps potentially unreliable operations. The design focuses on four key configuration parameters:

Invocation Timeout: How long to wait for an operation before considering it failed. For our KYC service, we set this to 5 seconds.

Failure Threshold: The absolute number of failures that will trip the circuit. We typically set this to 5 for critical services.

Failure Threshold Percentage: The percentage of failed calls that will trip the circuit. This prevents the circuit from opening due to a few failures during high-volume periods.

Retry Timeout: How long to wait before testing if a failed service has recovered. We usually start with 30 seconds and adjust based on the service's typical recovery time.

Implementation Details

The core of our circuit breaker is elegant in its simplicity:

circuit_breaker = CircuitBreaker::Shield.new(
  invocation_timeout: 5,
  failure_threshold: 5,
  failure_threshold_percentage: 0.3,
  retry_timeout: 30
)

result = circuit_breaker.protect do
  KycService.verify_user(user_id)
end

When the circuit is closed, operations execute normally. The circuit breaker tracks success and failure rates, watching for patterns that indicate the service is struggling.

Once failure thresholds are exceeded, the circuit opens. Subsequent calls fail fast without even attempting to contact the service. This immediate failure prevents our threads from being tied up waiting for responses that likely won't come.

After the retry timeout period, the circuit moves to half-open state. A single test call is allowed through. If it succeeds, the circuit closes and normal operation resumes. If it fails, the circuit opens again for another retry period.

Real-World Usage Patterns

We're using circuit breakers throughout our platform, each configured for its specific use case:

Payment Processing: Short timeout (2 seconds) with a low failure threshold. When payment gateways are slow, we need to fail fast to avoid user frustration.

Market Data: Longer timeout (10 seconds) with higher failure threshold. Market data can tolerate some delays, and we want to avoid opening the circuit for temporary slowdowns.

KYC Services: Medium timeout (5 seconds) with percentage-based thresholds. KYC typically has predictable response times, so we focus on percentage of failures rather than absolute numbers.

Monitoring and Observability

Circuit breakers are only as good as your ability to monitor them. We've integrated logging at key transition points:

• When circuits open (indicating service degradation)

• When circuits close (indicating service recovery)

• Failure rate patterns that approach thresholds

This visibility helps us identify service issues before they affect users and fine-tune our thresholds based on real-world behavior.

Handling Circuit Breaker States

Different circuit states require different application responses:

Circuit Closed: Business as usual. Operations proceed normally with full functionality.

Circuit Open: Graceful degradation. We show cached data, disable non-critical features, or provide alternative user flows.

Circuit Half-Open: Cautious operation. We might show loading states or temporary messages while testing service recovery.

Performance Impact

One concern with circuit breakers is the performance overhead of tracking calls and managing state. Our implementation is lightweight—the overhead is negligible compared to the network calls we're protecting.

More importantly, when circuits are open, we're saving significant resources by not making calls that would likely timeout anyway. The performance benefit during service degradation far outweighs any minimal overhead during normal operation.

Lessons Learned

Building and deploying our circuit breaker library is teaching us valuable lessons about resilient system design:

Thresholds Matter: Setting the right failure thresholds requires understanding your services' normal behavior patterns. Too sensitive, and you get false positives. Too lenient, and you don't get protection when you need it.

Recovery Time Varies: Different services have different recovery characteristics. Database issues might resolve in seconds, while third-party API issues could take minutes.

User Experience: Circuit breakers are most effective when paired with good fallback strategies. Failing fast is only valuable if you have something meaningful to show users.

Testing is Crucial: We're learning to test circuit breaker behavior under various failure scenarios. It's not enough to test normal operation—you need to verify behavior when things go wrong.

Looking Forward

Our circuit breaker library is proving its value in production. We're seeing faster recovery from service issues and better overall system stability.

Future enhancements we're considering include:

• Integration with our monitoring systems for automatic alerting

• More sophisticated failure detection beyond simple timeouts

• Metrics collection for better threshold tuning

• Support for bulkhead patterns to isolate different types of failures

Building Resilient Systems

Circuit breakers are just one part of building resilient distributed systems, but they're a crucial part. By failing fast when dependencies are unreliable, we protect our core platform functionality and maintain a better user experience even when external services struggle.

The key insight is that in distributed systems, failures are inevitable. The goal isn't to eliminate failures—it's to contain them and recover gracefully. Our circuit breaker library is helping us do exactly that.

Check out the complete implementation: circuit_breaker-ruby on GitHub