Skip to content

Step 4: Failover, Outlier Detection & Circuit Breaking

Duration: ~30 minutes Goal: Configure primary/backup failover, automatic endpoint ejection, circuit breaking, and retry policies.


What you will learn

  • How priority groups drive primary → backup failover
  • The difference between active health checks and outlier detection
  • How circuit breakers protect your upstreams
  • How retry policies work for transient gRPC errors
  • How to observe all of this through the admin API

Concepts

Priority groups

Every lb_endpoint belongs to an endpoint group with a priority number. Envoy sends traffic to the lowest priority group that has at least one healthy endpoint.

priority 0 (primaries): backend1, backend2  ← traffic goes here normally
priority 1 (backup):    backup              ← only used if ALL of priority 0 is unhealthy

Traffic only drains from priority 0 to priority 1 when Envoy considers a configurable percentage of priority-0 endpoints unhealthy. The threshold is controlled by overprovisioning_factor (default 140%), meaning traffic shifts when healthy capacity drops below 1/1.4 ≈ 71%. For a simple 2-node primary cluster, losing 1 of 2 (50% healthy) triggers partial spillover; losing both triggers full failover.

Partial spillover

Envoy can split traffic across priorities. If you have 3 primaries and 1 goes down, some traffic may spill to the backup depending on the overprovisioning factor. For a strict "all or nothing" failover, use a 2-node primary cluster and the default overprovisioning factor.

How priority failover is configured in YAML

The priority number is set directly on the endpoints list entry, not on individual endpoints:

load_assignment:
  cluster_name: grpc_cluster
  endpoints:
    - priority: 0           # ← set priority on the locality group
      lb_endpoints:
        - endpoint: ...     # primary 1
        - endpoint: ...     # primary 2
    - priority: 1           # ← separate locality group for backup
      lb_endpoints:
        - endpoint: ...     # backup

Each endpoints entry is a LocalityLbEndpoints message. All endpoints within one entry share the same priority and locality metadata.

Outlier detection

Outlier detection is passive: Envoy watches real traffic and ejects endpoints that show error patterns. No probes are sent.

Key parameters:

Parameter Meaning
consecutive_gateway_errors Consecutive 5xx or connection failures before ejection
consecutive_local_origin_failure Consecutive local-origin failures (connection refused, reset)
base_ejection_time How long the first ejection lasts
max_ejection_percent Max % of the cluster that can be ejected simultaneously
interval How often to evaluate ejection candidates

Ejection timeline:

  1. Envoy observes consecutive_gateway_errors (e.g. 5) consecutive failures from an endpoint.
  2. The endpoint is ejected for base_ejection_time × (number of times this endpoint has been ejected). The first ejection lasts exactly base_ejection_time. The second lasts 2 × base_ejection_time. This exponential backoff prevents flapping.
  3. After the ejection period, the endpoint is re-admitted to the load-balancing pool.
  4. If it fails again immediately, the next ejection period doubles.

max_ejection_percent is a safety valve: Envoy won't eject more than this percentage of the cluster, even if many endpoints are all failing. This prevents Envoy from ejecting so many endpoints that the remaining ones get overloaded.

Outlier detection vs active health checks

Both mechanisms mark endpoints unhealthy, but differently:

  • Active health checks mark an endpoint unhealthy only after the health check probe fails unhealthy_threshold times. The endpoint is considered FAILED_ACTIVE_HC.
  • Outlier detection ejects an endpoint based on real request errors. The endpoint is still "healthy" by active health check standards, just temporarily ejected from the LB pool.

An endpoint is excluded from traffic if it is unhealthy by either mechanism.

Circuit breaking

Circuit breakers limit how much load Envoy will send to a cluster. Unlike ejecting individual endpoints, circuit breaking applies to the entire cluster:

Setting What it limits
max_connections Open TCP connections to the cluster
max_pending_requests Requests queued waiting for a connection
max_requests Active concurrent requests
max_retries Active concurrent retries

When a limit is hit, requests are rejected immediately with 503 Service Unavailable. The stat cluster.<name>.upstream_rq_pending_overflow counts these rejections.

Retry policy

Retries are configured on the route, not the cluster. For gRPC, retry conditions:

Condition Triggers on
connect-failure TCP connection failure to upstream
reset Upstream reset the stream (GOAWAY, RST_STREAM)
resource-exhausted gRPC RESOURCE_EXHAUSTED status
unavailable gRPC UNAVAILABLE status

Retries and non-idempotent calls

Only retry when it is safe to re-send the request. Retrying a non-idempotent write can cause duplicates. Use the retriable-headers or retriable-status-codes retry conditions to limit retries to cases you control.


Setup

mkdir -p envoy-tutorial/step4 && cd envoy-tutorial/step4

Backend: server.go

This backend supports a FAIL_ON_START env var and an in-process mode where it can be made to return errors — useful for triggering outlier detection without stopping containers.

package main

import (
    "context"
    "log"
    "net"
    "net/http"
    "os"
    "sync/atomic"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/health"
    "google.golang.org/grpc/health/grpc_health_v1"
    "google.golang.org/grpc/reflection"
    "google.golang.org/grpc/status"

    pb "google.golang.org/grpc/examples/helloworld/helloworld"
)

var failing atomic.Bool

type server struct{ pb.UnimplementedGreeterServer }

func (s *server) SayHello(_ context.Context, req *pb.HelloRequest) (*pb.HelloReply, error) {
    if failing.Load() {
        return nil, status.Error(codes.Unavailable, "server is in failing mode")
    }
    hostname, _ := os.Hostname()
    return &pb.HelloReply{
        Message: "Hello " + req.Name + " from " + hostname,
    }, nil
}

func main() {
    if os.Getenv("FAIL_ON_START") == "true" {
        failing.Store(true)
        log.Println("starting in FAILING mode")
    }

    healthSrv := health.NewServer()

    // Sync health status with failing flag
    go func() {
        for range time.Tick(500 * time.Millisecond) {
            if failing.Load() {
                healthSrv.SetServingStatus("", grpc_health_v1.HealthCheckResponse_NOT_SERVING)
            } else {
                healthSrv.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)
            }
        }
    }()

    // HTTP control endpoint: POST /fail and POST /recover
    go func() {
        mux := http.NewServeMux()
        mux.HandleFunc("/fail", func(w http.ResponseWriter, _ *http.Request) {
            failing.Store(true)
            log.Println("switched to FAILING mode via HTTP")
            w.Write([]byte("now failing\n"))
        })
        mux.HandleFunc("/recover", func(w http.ResponseWriter, _ *http.Request) {
            failing.Store(false)
            log.Println("switched to HEALTHY mode via HTTP")
            w.Write([]byte("now healthy\n"))
        })
        log.Println("control server on :8080")
        http.ListenAndServe(":8080", mux)
    }()

    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatalf("failed to listen: %v", err)
    }
    s := grpc.NewServer()
    pb.RegisterGreeterServer(s, &server{})
    grpc_health_v1.RegisterHealthServer(s, healthSrv)
    reflection.Register(s)

    hostname, _ := os.Hostname()
    log.Printf("gRPC server on :50051 (hostname=%s)", hostname)
    s.Serve(lis)
}

Dockerfile

FROM golang:1.22-alpine AS build
WORKDIR /app
COPY go.mod server.go ./
RUN go mod download && go build -o server .

FROM alpine:latest
COPY --from=build /app/server /server
ENTRYPOINT ["/server"]

docker-compose.yaml

services:
  primary1:
    build: .

  primary2:
    build: .

  backup:
    build: .

  envoy:
    image: envoyproxy/envoy:v1.31-latest
    volumes:
      - ./envoy.yaml:/etc/envoy/envoy.yaml
    ports:
      - "10000:10000"
      - "9901:9901"
    command: envoy -c /etc/envoy/envoy.yaml --log-level info
    depends_on:
      - primary1
      - primary2
      - backup

Envoy config: envoy.yaml

static_resources:
  listeners:
    - name: grpc_listener
      address:
        socket_address: { address: 0.0.0.0, port_value: 10000 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: grpc_proxy
                codec_type: AUTO
                access_log:
                  - name: envoy.access_loggers.stdout
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
                route_config:
                  name: grpc_route
                  virtual_hosts:
                    - name: grpc_services
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/", grpc: {} }
                          route:
                            cluster: grpc_cluster
                            timeout: 0s
                            retry_policy:  # (1)
                              retry_on: "connect-failure,reset,resource-exhausted,unavailable"
                              num_retries: 3
                              per_try_timeout: 5s
                http_filters:
                  - name: envoy.filters.http.grpc_stats
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.grpc_stats.v3.FilterConfig
                      emit_filter_state: true
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: grpc_cluster
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: LEAST_REQUEST
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options: {}

      # --- CIRCUIT BREAKER ---  (2)
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 100
            max_pending_requests: 50
            max_requests: 200
            max_retries: 3

      # --- OUTLIER DETECTION ---  (3)
      outlier_detection:
        consecutive_gateway_errors: 5
        consecutive_local_origin_failure: 5
        interval: 10s
        base_ejection_time: 30s
        max_ejection_percent: 50

      # --- ACTIVE HEALTH CHECKING ---
      health_checks:
        - timeout: 2s
          interval: 5s
          unhealthy_threshold: 2
          healthy_threshold: 1
          grpc_health_check: {}

      # --- PRIORITY GROUPS ---  (4)
      load_assignment:
        cluster_name: grpc_cluster
        endpoints:
          - priority: 0  # primaries — receives all traffic when healthy
            lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: primary1, port_value: 50051 }
              - endpoint:
                  address:
                    socket_address: { address: primary2, port_value: 50051 }
          - priority: 1  # backup — only used when ALL priority 0 endpoints are unhealthy
            lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: backup, port_value: 50051 }

admin:
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }
  1. Retry policy on the route: retry on connect-failure, stream reset, and gRPC UNAVAILABLE/RESOURCE_EXHAUSTED status codes. per_try_timeout limits each individual attempt, while timeout on the route (set to 0s) disables the total deadline. When timeout: 0s and per_try_timeout: 5s, each retry attempt has up to 5 seconds, but the total call can run indefinitely across all retries.
  2. Circuit breaker limits prevent cascading failure during an overload event. The max_retries: 3 limit is particularly important — it prevents retry storms where retries generate more retries. priority: DEFAULT sets limits for normal traffic; you can also set separate limits for HIGH priority traffic (used internally by health checks).
  3. Outlier detection watches real traffic. After 5 consecutive errors from an endpoint, it is ejected for base_ejection_time (30s). The second ejection lasts 60s, the third 90s, etc. At most 50% of the cluster can be ejected at once — so with 2 primaries, at most 1 is ejected at a time.
  4. Priority 0 endpoints receive all traffic normally. Priority 1 only activates when Envoy determines that not enough priority-0 capacity is healthy (below ~71% with the default overprovisioning_factor: 140).

Run it

docker compose up --build

Exercises

1. Verify only primaries serve traffic

for i in $(seq 1 8); do
  grpcurl -plaintext -d '{"name": "test"}' \
    localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
  | grep message
done

All responses should show primary1 or primary2 container IDs — never backup.

2. Kill both primaries and watch failover

docker compose stop primary1 primary2

Wait 10–15 seconds for health checks to detect failure (unhealthy_threshold: 2 × interval: 5s = 10s). Then send requests:

for i in $(seq 1 6); do
  grpcurl -plaintext -d '{"name": "test"}' \
    localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
  | grep message
done

Now you should see the backup container ID in all responses.

3. Watch priority state in the admin API

curl -s http://localhost:9901/clusters | grep -E "(priority|health_flags|cx_connect)"

Look for health_flags — unhealthy endpoints show ::failed_active_hc.

For a structured view, use the JSON endpoint:

curl -s http://localhost:9901/clusters?format=json | python3 -m json.tool | grep -A5 priority

4. Restore primaries and verify traffic returns

docker compose start primary1 primary2

Wait 10s, then send requests again. Traffic should shift back to the primaries automatically once they pass healthy_threshold: 1 health check.

5. Trigger outlier detection without stopping containers

The backend exposes a control HTTP endpoint. Find the container names:

docker compose ps

Then switch primary1 to failing mode (adjust service name if needed):

# Send the /fail command to primary1's control port
docker compose exec primary1 wget -qO- http://localhost:8080/fail

Now send gRPC requests — primary1 will start returning UNAVAILABLE. After 5 consecutive errors, outlier detection ejects it:

for i in $(seq 1 20); do
  grpcurl -plaintext -d '{"name": "test"}' \
    localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
  | grep message
done

Check outlier detection stats:

curl -s http://localhost:9901/stats | grep "outlier_detection\|ejected"

Recover the backend:

docker compose exec primary1 wget -qO- http://localhost:8080/recover

6. Observe circuit breaker

Check the overflow counters (they start at zero):

curl -s http://localhost:9901/stats | grep overflow

To trigger circuit breaking intentionally, you can reduce max_requests to a very low value (e.g. 1) and send concurrent requests with a load testing tool like ghz.


Key admin API stats for this step

Stat Meaning
cluster.grpc_cluster.upstream_rq_retry Total retries performed
cluster.grpc_cluster.upstream_rq_retry_overflow Retries rejected by circuit breaker
cluster.grpc_cluster.upstream_rq_pending_overflow Requests rejected (pending limit hit)
cluster.grpc_cluster.outlier_detection.ejections_active Currently ejected endpoints
cluster.grpc_cluster.outlier_detection.ejections_total All-time ejection count

What you learned

  • Priority groups: priority: 0 for primaries, priority: 1 for backup
  • Active health checks + outlier detection work together
  • Circuit breaker thresholds per priority level
  • Retry policy for transient gRPC errors
  • How to trigger and observe failover through the admin API

Next step

In Step 5 you will move everything to Kubernetes using the Gateway API and dynamic endpoint discovery.