Step 4: Failover, Outlier Detection & Circuit Breaking¶

Duration: ~30 minutes Goal: Configure primary/backup failover, automatic endpoint ejection, circuit breaking, and retry policies.

What you will learn¶

How priority groups drive primary → backup failover
The difference between active health checks and outlier detection
How circuit breakers protect your upstreams
How retry policies work for transient gRPC errors
How to observe all of this through the admin API

Concepts¶

Priority groups¶

Every lb_endpoint belongs to an endpoint group with a priority number. Envoy sends traffic to the lowest priority group that has at least one healthy endpoint.

priority 0 (primaries): backend1, backend2  ← traffic goes here normally
priority 1 (backup):    backup              ← only used if ALL of priority 0 is unhealthy

Traffic only drains from priority 0 to priority 1 when Envoy considers a configurable percentage of priority-0 endpoints unhealthy. The threshold is controlled by overprovisioning_factor (default 140%), meaning traffic shifts when healthy capacity drops below 1/1.4 ≈ 71%. For a simple 2-node primary cluster, losing 1 of 2 (50% healthy) triggers partial spillover; losing both triggers full failover.

Partial spillover

Envoy can split traffic across priorities. If you have 3 primaries and 1 goes down, some traffic may spill to the backup depending on the overprovisioning factor. For a strict "all or nothing" failover, use a 2-node primary cluster and the default overprovisioning factor.

How priority failover is configured in YAML¶

The priority number is set directly on the endpoints list entry, not on individual endpoints:

load_assignment:
  cluster_name: grpc_cluster
  endpoints:
    - priority: 0           # ← set priority on the locality group
      lb_endpoints:
        - endpoint: ...     # primary 1
        - endpoint: ...     # primary 2
    - priority: 1           # ← separate locality group for backup
      lb_endpoints:
        - endpoint: ...     # backup

Each endpoints entry is a LocalityLbEndpoints message. All endpoints within one entry share the same priority and locality metadata.

Outlier detection¶

Outlier detection is passive: Envoy watches real traffic and ejects endpoints that show error patterns. No probes are sent.

Key parameters:

Parameter	Meaning
`consecutive_gateway_errors`	Consecutive 5xx or connection failures before ejection
`consecutive_local_origin_failure`	Consecutive local-origin failures (connection refused, reset)
`base_ejection_time`	How long the first ejection lasts
`max_ejection_percent`	Max % of the cluster that can be ejected simultaneously
`interval`	How often to evaluate ejection candidates

Ejection timeline:

Envoy observes consecutive_gateway_errors (e.g. 5) consecutive failures from an endpoint.
The endpoint is ejected for base_ejection_time × (number of times this endpoint has been ejected). The first ejection lasts exactly base_ejection_time. The second lasts 2 × base_ejection_time. This exponential backoff prevents flapping.
After the ejection period, the endpoint is re-admitted to the load-balancing pool.
If it fails again immediately, the next ejection period doubles.

max_ejection_percent is a safety valve: Envoy won't eject more than this percentage of the cluster, even if many endpoints are all failing. This prevents Envoy from ejecting so many endpoints that the remaining ones get overloaded.

Outlier detection vs active health checks

Both mechanisms mark endpoints unhealthy, but differently:

Active health checks mark an endpoint unhealthy only after the health check probe fails unhealthy_threshold times. The endpoint is considered FAILED_ACTIVE_HC.
Outlier detection ejects an endpoint based on real request errors. The endpoint is still "healthy" by active health check standards, just temporarily ejected from the LB pool.

An endpoint is excluded from traffic if it is unhealthy by either mechanism.

Circuit breaking¶

Circuit breakers limit how much load Envoy will send to a cluster. Unlike ejecting individual endpoints, circuit breaking applies to the entire cluster:

Setting	What it limits
`max_connections`	Open TCP connections to the cluster
`max_pending_requests`	Requests queued waiting for a connection
`max_requests`	Active concurrent requests
`max_retries`	Active concurrent retries

When a limit is hit, requests are rejected immediately with 503 Service Unavailable. The stat cluster.<name>.upstream_rq_pending_overflow counts these rejections.

Retry policy¶

Retries are configured on the route, not the cluster. For gRPC, retry conditions:

Condition	Triggers on
`connect-failure`	TCP connection failure to upstream
`reset`	Upstream reset the stream (GOAWAY, RST_STREAM)
`resource-exhausted`	gRPC `RESOURCE_EXHAUSTED` status
`unavailable`	gRPC `UNAVAILABLE` status

Retries and non-idempotent calls

Only retry when it is safe to re-send the request. Retrying a non-idempotent write can cause duplicates. Use the retriable-headers or retriable-status-codes retry conditions to limit retries to cases you control.

Setup¶

mkdir -p envoy-tutorial/step4 && cd envoy-tutorial/step4

Backend: `server.go`¶

This backend supports a FAIL_ON_START env var and an in-process mode where it can be made to return errors — useful for triggering outlier detection without stopping containers.

package main

import (
    "context"
    "log"
    "net"
    "net/http"
    "os"
    "sync/atomic"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/health"
    "google.golang.org/grpc/health/grpc_health_v1"
    "google.golang.org/grpc/reflection"
    "google.golang.org/grpc/status"

    pb "google.golang.org/grpc/examples/helloworld/helloworld"
)

var failing atomic.Bool

type server struct{ pb.UnimplementedGreeterServer }

func (s *server) SayHello(_ context.Context, req *pb.HelloRequest) (*pb.HelloReply, error) {
    if failing.Load() {
        return nil, status.Error(codes.Unavailable, "server is in failing mode")
    }
    hostname, _ := os.Hostname()
    return &pb.HelloReply{
        Message: "Hello " + req.Name + " from " + hostname,
    }, nil
}

func main() {
    if os.Getenv("FAIL_ON_START") == "true" {
        failing.Store(true)
        log.Println("starting in FAILING mode")
    }

    healthSrv := health.NewServer()

    // Sync health status with failing flag
    go func() {
        for range time.Tick(500 * time.Millisecond) {
            if failing.Load() {
                healthSrv.SetServingStatus("", grpc_health_v1.HealthCheckResponse_NOT_SERVING)
            } else {
                healthSrv.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)
            }
        }
    }()

    // HTTP control endpoint: POST /fail and POST /recover
    go func() {
        mux := http.NewServeMux()
        mux.HandleFunc("/fail", func(w http.ResponseWriter, _ *http.Request) {
            failing.Store(true)
            log.Println("switched to FAILING mode via HTTP")
            w.Write([]byte("now failing\n"))
        })
        mux.HandleFunc("/recover", func(w http.ResponseWriter, _ *http.Request) {
            failing.Store(false)
            log.Println("switched to HEALTHY mode via HTTP")
            w.Write([]byte("now healthy\n"))
        })
        log.Println("control server on :8080")
        http.ListenAndServe(":8080", mux)
    }()

    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatalf("failed to listen: %v", err)
    }
    s := grpc.NewServer()
    pb.RegisterGreeterServer(s, &server{})
    grpc_health_v1.RegisterHealthServer(s, healthSrv)
    reflection.Register(s)

    hostname, _ := os.Hostname()
    log.Printf("gRPC server on :50051 (hostname=%s)", hostname)
    s.Serve(lis)
}

`Dockerfile`¶

FROM golang:1.22-alpine AS build
WORKDIR /app
COPY go.mod server.go ./
RUN go mod download && go build -o server .

FROM alpine:latest
COPY --from=build /app/server /server
ENTRYPOINT ["/server"]

`docker-compose.yaml`¶

services:
  primary1:
    build: .

  primary2:
    build: .

  backup:
    build: .

  envoy:
    image: envoyproxy/envoy:v1.31-latest
    volumes:
      - ./envoy.yaml:/etc/envoy/envoy.yaml
    ports:
      - "10000:10000"
      - "9901:9901"
    command: envoy -c /etc/envoy/envoy.yaml --log-level info
    depends_on:
      - primary1
      - primary2
      - backup

Envoy config: `envoy.yaml`¶

static_resources:
  listeners:
    - name: grpc_listener
      address:
        socket_address: { address: 0.0.0.0, port_value: 10000 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: grpc_proxy
                codec_type: AUTO
                access_log:
                  - name: envoy.access_loggers.stdout
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
                route_config:
                  name: grpc_route
                  virtual_hosts:
                    - name: grpc_services
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/", grpc: {} }
                          route:
                            cluster: grpc_cluster
                            timeout: 0s
                            retry_policy:  # (1)
                              retry_on: "connect-failure,reset,resource-exhausted,unavailable"
                              num_retries: 3
                              per_try_timeout: 5s
                http_filters:
                  - name: envoy.filters.http.grpc_stats
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.grpc_stats.v3.FilterConfig
                      emit_filter_state: true
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: grpc_cluster
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: LEAST_REQUEST
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options: {}

      # --- CIRCUIT BREAKER ---  (2)
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 100
            max_pending_requests: 50
            max_requests: 200
            max_retries: 3

      # --- OUTLIER DETECTION ---  (3)
      outlier_detection:
        consecutive_gateway_errors: 5
        consecutive_local_origin_failure: 5
        interval: 10s
        base_ejection_time: 30s
        max_ejection_percent: 50

      # --- ACTIVE HEALTH CHECKING ---
      health_checks:
        - timeout: 2s
          interval: 5s
          unhealthy_threshold: 2
          healthy_threshold: 1
          grpc_health_check: {}

      # --- PRIORITY GROUPS ---  (4)
      load_assignment:
        cluster_name: grpc_cluster
        endpoints:
          - priority: 0  # primaries — receives all traffic when healthy
            lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: primary1, port_value: 50051 }
              - endpoint:
                  address:
                    socket_address: { address: primary2, port_value: 50051 }
          - priority: 1  # backup — only used when ALL priority 0 endpoints are unhealthy
            lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: backup, port_value: 50051 }

admin:
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

Retry policy on the route: retry on connect-failure, stream reset, and gRPC UNAVAILABLE/RESOURCE_EXHAUSTED status codes. per_try_timeout limits each individual attempt, while timeout on the route (set to 0s) disables the total deadline. When timeout: 0s and per_try_timeout: 5s, each retry attempt has up to 5 seconds, but the total call can run indefinitely across all retries.
Circuit breaker limits prevent cascading failure during an overload event. The max_retries: 3 limit is particularly important — it prevents retry storms where retries generate more retries. priority: DEFAULT sets limits for normal traffic; you can also set separate limits for HIGH priority traffic (used internally by health checks).
Outlier detection watches real traffic. After 5 consecutive errors from an endpoint, it is ejected for base_ejection_time (30s). The second ejection lasts 60s, the third 90s, etc. At most 50% of the cluster can be ejected at once — so with 2 primaries, at most 1 is ejected at a time.
Priority 0 endpoints receive all traffic normally. Priority 1 only activates when Envoy determines that not enough priority-0 capacity is healthy (below ~71% with the default overprovisioning_factor: 140).

Run it¶

docker compose up --build

Exercises¶

1. Verify only primaries serve traffic¶

for i in $(seq 1 8); do
  grpcurl -plaintext -d '{"name": "test"}' \
    localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
  | grep message
done

All responses should show primary1 or primary2 container IDs — never backup.

2. Kill both primaries and watch failover¶

docker compose stop primary1 primary2

Wait 10–15 seconds for health checks to detect failure (unhealthy_threshold: 2 × interval: 5s = 10s). Then send requests:

for i in $(seq 1 6); do
  grpcurl -plaintext -d '{"name": "test"}' \
    localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
  | grep message
done

Now you should see the backup container ID in all responses.

3. Watch priority state in the admin API¶

curl -s http://localhost:9901/clusters | grep -E "(priority|health_flags|cx_connect)"

Look for health_flags — unhealthy endpoints show ::failed_active_hc.

For a structured view, use the JSON endpoint:

curl -s http://localhost:9901/clusters?format=json | python3 -m json.tool | grep -A5 priority

4. Restore primaries and verify traffic returns¶

docker compose start primary1 primary2

Wait 10s, then send requests again. Traffic should shift back to the primaries automatically once they pass healthy_threshold: 1 health check.

5. Trigger outlier detection without stopping containers¶

The backend exposes a control HTTP endpoint. Find the container names:

docker compose ps

Then switch primary1 to failing mode (adjust service name if needed):

# Send the /fail command to primary1's control port
docker compose exec primary1 wget -qO- http://localhost:8080/fail

Now send gRPC requests — primary1 will start returning UNAVAILABLE. After 5 consecutive errors, outlier detection ejects it:

for i in $(seq 1 20); do
  grpcurl -plaintext -d '{"name": "test"}' \
    localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
  | grep message
done

Check outlier detection stats:

curl -s http://localhost:9901/stats | grep "outlier_detection\|ejected"

Recover the backend:

docker compose exec primary1 wget -qO- http://localhost:8080/recover

6. Observe circuit breaker¶

Check the overflow counters (they start at zero):

curl -s http://localhost:9901/stats | grep overflow

To trigger circuit breaking intentionally, you can reduce max_requests to a very low value (e.g. 1) and send concurrent requests with a load testing tool like ghz.

Key admin API stats for this step¶

Stat	Meaning
`cluster.grpc_cluster.upstream_rq_retry`	Total retries performed
`cluster.grpc_cluster.upstream_rq_retry_overflow`	Retries rejected by circuit breaker
`cluster.grpc_cluster.upstream_rq_pending_overflow`	Requests rejected (pending limit hit)
`cluster.grpc_cluster.outlier_detection.ejections_active`	Currently ejected endpoints
`cluster.grpc_cluster.outlier_detection.ejections_total`	All-time ejection count

What you learned¶

Priority groups: priority: 0 for primaries, priority: 1 for backup
Active health checks + outlier detection work together
Circuit breaker thresholds per priority level
Retry policy for transient gRPC errors
How to trigger and observe failover through the admin API

Next step

In Step 5 you will move everything to Kubernetes using the Gateway API and dynamic endpoint discovery.