Building Resilient Systems: Circuit Breakers & Retries | Ugur Kaval

It is 3:14 AM on a Tuesday. Your monitoring dashboard is a sea of red, yet the CPU on your core services is idling at 5%. You have just fallen victim to a cascading failure where your 'resilient' retry logic turned a minor network blip into a self-inflicted Distributed Denial of Service (DDoS) attack. I have seen this exact scenario play out at three different scale-ups, and it is almost always the result of well-intentioned but naive error handling.

In 2026, we are no longer just 'servers talking to servers.' We are orchestrating a complex dance between local microservices, global edge functions, and third-party LLM providers. When an LLM provider's latency spikes from 200ms to 12s, your standard 30s timeout is a death sentence for your connection pool. If you are not using circuit breakers and intelligent retries, you are not building a distributed system; you are building a house of cards waiting for the first breeze.

The Fallacy of the Simple Retry

Most engineers start with a simple loop: if it fails, try again three times. This is dangerous. If a downstream service is struggling because it is overloaded, sending three times the traffic is the fastest way to ensure it never recovers. This is known as the 'amplification effect.'

To do retries correctly in 2026, you must implement three things: Exponential Backoff, Jitter, and a Retry Budget.

Exponential Backoff and Jitter

Exponential backoff increases the wait time between retries, giving the downstream service breathing room. However, if 1,000 instances of your service all start retrying at the exact same 1s, 2s, and 4s intervals, you get 'thundering herd' spikes. You need Jitter—randomized noise added to the delay.

Here is how I implement this in Go 1.26, using a pattern that has survived massive traffic spikes in production:

package resilience

import (
	"context"
	"math/rand/v2"
	"time"
)

func ExecuteWithRetry(ctx context.Context, operation func() error) error {
	const (
		maxRetries = 3
		baseDelay  = 100 * time.Millisecond
		maxDelay   = 2 * time.Second
	)

	for i := 0; i <= maxRetries; i++ {
		err := operation()
		if err == nil {
			return nil
		}

		if i == maxRetries {
			return err
		}

		// Calculate exponential backoff: base * 2^i
		backoff := float64(baseDelay) * float64(uint(1)<<uint(i))
		
		// Apply Full Jitter: random between 0 and backoff
		// This is more effective than Equal Jitter for breaking up clusters
		sleepTime := time.Duration(rand.Float64() * backoff)
		
		if sleepTime > maxDelay {
			sleepTime = maxDelay
		}

		select {
		case <-time.After(sleepTime):
		case <-ctx.Done():
			return ctx.Err()
		}
	}
	return nil
}


> **Pro Tip:** Never retry on 4xx errors (except 429 Too Many Requests). Retrying a 400 Bad Request is just wasting CPU cycles to get the same error back.

The Circuit Breaker: Your System's Emergency Brake

While retries protect the client, circuit breakers protect the entire ecosystem. A circuit breaker tracks failures over a sliding window. If the failure rate exceeds a threshold (e.g., 50% of requests in 10 seconds), the circuit 'trips' (Opens). For a set period, all calls to that service fail fast immediately without even hitting the network.

This prevents the 'latency tail' from consuming all your worker threads. If you have 500 threads waiting for a 30s timeout from a dead service, your API stops responding to everything else. The breaker prevents this exhaustion.

The Three States

Closed: Requests flow normally. Failures are tracked.
Open: Requests fail immediately. No network calls are made.
Half-Open: After a 'sleep window,' a few trial requests are sent. If they succeed, the circuit closes. If they fail, it re-opens.

In my projects, I use gobreaker or the resilience features built into modern service meshes like Istio or Linkerd. However, application-level breakers are often better because they can be context-aware (e.g., breaking only for specific high-cost endpoints).

package resilience

import (
	"errors"
	"fmt"
	"github.com/sony/gobreaker"
	"time"
)

var cb *gobreaker.CircuitBreaker

func init() {
	settings := gobreaker.Settings{
		Name:        "Payment-Gateway",
		MaxRequests: 5,               // Allow 5 requests in Half-Open state
		Interval:    10 * time.Second, // Clear counts every 10s when Closed
		Timeout:     30 * time.Second, // Stay Open for 30s before trying Half-Open
		ReadyToTrip: func(counts gobreaker.Counts) bool {
			failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
			// Trip if we have > 10 requests and failure rate is > 60%
			return counts.Requests >= 10 && failureRatio > 0.6
		},
		OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
			fmt.Printf("Circuit Breaker [%s] changed from %s to %s
", name, from, to)
			// In production, push this to Prometheus/Grafana immediately
		},
	}
	cb = gobreaker.NewCircuitBreaker(settings)
}

func CallExternalService() ([]byte, error) {
	body, err := cb.Execute(func() (interface{}, error) {
		// Your actual HTTP call here
		resp, err := httpClient.Get("https://api.gateway.com/v1/pay")
		if err != nil {
			return nil, err
		}
		if resp.StatusCode >= 500 {
			return nil, errors.New("downstream error")
		}
		return resp, nil
	})

	if err != nil {
		return nil, err
	}
	return body.([]byte), nil
}

What the Docs Don't Tell You: The Gotchas

I have learned the hard way that you cannot just 'set and forget' these patterns. Here are the three most common mistakes I see senior engineers make:

1. The Double-Retry

If Service A calls Service B, and Service B calls Service C, and both A and B have a 3x retry policy, a single failure in Service C results in 9 requests hitting it. If you have a deeper chain, you get an exponential explosion of traffic. The Fix: Only retry at the outermost layer or use 'Retry Budgets' (e.g., only allow 10% of total traffic to be retries).

2. Ignoring Idempotency

Retrying a POST /payments request is a disaster if you haven't implemented idempotency keys. You will double-charge the customer. The Fix: Every retry-able write operation must carry a unique X-Idempotency-Key (usually a UUID) that the server uses to ensure it only processes the request once.

3. Static Thresholds in Dynamic Environments

In 2026, workloads are elastic. A 50% failure threshold might be fine when you have 100 requests/sec, but it's too slow to react when you have 10,000 requests/sec. The Fix: Use adaptive concurrency limits. Instead of a hard failure count, monitor the trend of the P99 latency. If latency starts climbing, trip the breaker before the errors start happening.

Observability: The Missing Link

A circuit breaker that trips in silence is a bug, not a feature. You must hook into the OnStateChange events to fire alerts. In my current stack, we use OpenTelemetry (OTel) to decorate our traces. When a breaker opens, we inject a specific attribute into the span: resilience.circuit_breaker.state = open.

This allows us to look at a trace and immediately see: "Oh, the reason this request failed isn't because the database is slow; it's because our circuit breaker prevented the call to the database to save the system."

Takeaway

Reliability is not about preventing errors; it is about containing them. Today, go to your most critical downstream dependency—whether it is a database, a legacy API, or an LLM—and check if you have a circuit breaker. If you don't, wrap that client call in a breaker with a 50% failure threshold and a 30-second timeout. It is the single most effective thing you can do to prevent your next 3 AM outage.

Stop Killing Your Downstream: Circuit Breakers and Retries in 2026

The Fallacy of the Simple Retry

Exponential Backoff and Jitter

The Circuit Breaker: Your System's Emergency Brake

The Three States

What the Docs Don't Tell You: The Gotchas

1. The Double-Retry

2. Ignoring Idempotency

3. Static Thresholds in Dynamic Environments

Observability: The Missing Link

Takeaway

Enjoyed this article?

Related Articles

Stop Killing Your Downstreams: A Practical Guide to Resiliency in 2026

Beyond the Log File: Engineering Observability for Scale in 2026

Uğur Kaval

Microservices Communication Patterns: REST vs gRPC vs Message Queues