Step 4: Failover, Outlier Detection & Circuit Breaking¶
Duration: ~30 minutes Goal: Configure primary/backup failover, automatic endpoint ejection, circuit breaking, and retry policies.
What you will learn¶
- How priority groups drive primary → backup failover
- The difference between active health checks and outlier detection
- How circuit breakers protect your upstreams
- How retry policies work for transient gRPC errors
- How to observe all of this through the admin API
Concepts¶
Priority groups¶
Every lb_endpoint belongs to an endpoint group with a priority number. Envoy sends traffic to the lowest priority group that has at least one healthy endpoint.
priority 0 (primaries): backend1, backend2 ← traffic goes here normally
priority 1 (backup): backup ← only used if ALL of priority 0 is unhealthy
Traffic only drains from priority 0 to priority 1 when Envoy considers a configurable percentage of priority-0 endpoints unhealthy. The threshold is controlled by overprovisioning_factor (default 140%), meaning traffic shifts when healthy capacity drops below 1/1.4 ≈ 71%. For a simple 2-node primary cluster, losing 1 of 2 (50% healthy) triggers partial spillover; losing both triggers full failover.
Partial spillover
Envoy can split traffic across priorities. If you have 3 primaries and 1 goes down, some traffic may spill to the backup depending on the overprovisioning factor. For a strict "all or nothing" failover, use a 2-node primary cluster and the default overprovisioning factor.
How priority failover is configured in YAML¶
The priority number is set directly on the endpoints list entry, not on individual endpoints:
load_assignment:
cluster_name: grpc_cluster
endpoints:
- priority: 0 # ← set priority on the locality group
lb_endpoints:
- endpoint: ... # primary 1
- endpoint: ... # primary 2
- priority: 1 # ← separate locality group for backup
lb_endpoints:
- endpoint: ... # backup
Each endpoints entry is a LocalityLbEndpoints message. All endpoints within one entry share the same priority and locality metadata.
Outlier detection¶
Outlier detection is passive: Envoy watches real traffic and ejects endpoints that show error patterns. No probes are sent.
Key parameters:
| Parameter | Meaning |
|---|---|
consecutive_gateway_errors |
Consecutive 5xx or connection failures before ejection |
consecutive_local_origin_failure |
Consecutive local-origin failures (connection refused, reset) |
base_ejection_time |
How long the first ejection lasts |
max_ejection_percent |
Max % of the cluster that can be ejected simultaneously |
interval |
How often to evaluate ejection candidates |
Ejection timeline:
- Envoy observes
consecutive_gateway_errors(e.g. 5) consecutive failures from an endpoint. - The endpoint is ejected for
base_ejection_time× (number of times this endpoint has been ejected). The first ejection lasts exactlybase_ejection_time. The second lasts2 × base_ejection_time. This exponential backoff prevents flapping. - After the ejection period, the endpoint is re-admitted to the load-balancing pool.
- If it fails again immediately, the next ejection period doubles.
max_ejection_percent is a safety valve: Envoy won't eject more than this percentage of the cluster, even if many endpoints are all failing. This prevents Envoy from ejecting so many endpoints that the remaining ones get overloaded.
Outlier detection vs active health checks
Both mechanisms mark endpoints unhealthy, but differently:
- Active health checks mark an endpoint unhealthy only after the health check probe fails
unhealthy_thresholdtimes. The endpoint is consideredFAILED_ACTIVE_HC. - Outlier detection ejects an endpoint based on real request errors. The endpoint is still "healthy" by active health check standards, just temporarily ejected from the LB pool.
An endpoint is excluded from traffic if it is unhealthy by either mechanism.
Circuit breaking¶
Circuit breakers limit how much load Envoy will send to a cluster. Unlike ejecting individual endpoints, circuit breaking applies to the entire cluster:
| Setting | What it limits |
|---|---|
max_connections |
Open TCP connections to the cluster |
max_pending_requests |
Requests queued waiting for a connection |
max_requests |
Active concurrent requests |
max_retries |
Active concurrent retries |
When a limit is hit, requests are rejected immediately with 503 Service Unavailable. The stat cluster.<name>.upstream_rq_pending_overflow counts these rejections.
Retry policy¶
Retries are configured on the route, not the cluster. For gRPC, retry conditions:
| Condition | Triggers on |
|---|---|
connect-failure |
TCP connection failure to upstream |
reset |
Upstream reset the stream (GOAWAY, RST_STREAM) |
resource-exhausted |
gRPC RESOURCE_EXHAUSTED status |
unavailable |
gRPC UNAVAILABLE status |
Retries and non-idempotent calls
Only retry when it is safe to re-send the request. Retrying a non-idempotent write can cause duplicates. Use the retriable-headers or retriable-status-codes retry conditions to limit retries to cases you control.
Setup¶
Backend: server.go¶
This backend supports a FAIL_ON_START env var and an in-process mode where it can be made to return errors — useful for triggering outlier detection without stopping containers.
package main
import (
"context"
"log"
"net"
"net/http"
"os"
"sync/atomic"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/health"
"google.golang.org/grpc/health/grpc_health_v1"
"google.golang.org/grpc/reflection"
"google.golang.org/grpc/status"
pb "google.golang.org/grpc/examples/helloworld/helloworld"
)
var failing atomic.Bool
type server struct{ pb.UnimplementedGreeterServer }
func (s *server) SayHello(_ context.Context, req *pb.HelloRequest) (*pb.HelloReply, error) {
if failing.Load() {
return nil, status.Error(codes.Unavailable, "server is in failing mode")
}
hostname, _ := os.Hostname()
return &pb.HelloReply{
Message: "Hello " + req.Name + " from " + hostname,
}, nil
}
func main() {
if os.Getenv("FAIL_ON_START") == "true" {
failing.Store(true)
log.Println("starting in FAILING mode")
}
healthSrv := health.NewServer()
// Sync health status with failing flag
go func() {
for range time.Tick(500 * time.Millisecond) {
if failing.Load() {
healthSrv.SetServingStatus("", grpc_health_v1.HealthCheckResponse_NOT_SERVING)
} else {
healthSrv.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)
}
}
}()
// HTTP control endpoint: POST /fail and POST /recover
go func() {
mux := http.NewServeMux()
mux.HandleFunc("/fail", func(w http.ResponseWriter, _ *http.Request) {
failing.Store(true)
log.Println("switched to FAILING mode via HTTP")
w.Write([]byte("now failing\n"))
})
mux.HandleFunc("/recover", func(w http.ResponseWriter, _ *http.Request) {
failing.Store(false)
log.Println("switched to HEALTHY mode via HTTP")
w.Write([]byte("now healthy\n"))
})
log.Println("control server on :8080")
http.ListenAndServe(":8080", mux)
}()
lis, err := net.Listen("tcp", ":50051")
if err != nil {
log.Fatalf("failed to listen: %v", err)
}
s := grpc.NewServer()
pb.RegisterGreeterServer(s, &server{})
grpc_health_v1.RegisterHealthServer(s, healthSrv)
reflection.Register(s)
hostname, _ := os.Hostname()
log.Printf("gRPC server on :50051 (hostname=%s)", hostname)
s.Serve(lis)
}
Dockerfile¶
FROM golang:1.22-alpine AS build
WORKDIR /app
COPY go.mod server.go ./
RUN go mod download && go build -o server .
FROM alpine:latest
COPY --from=build /app/server /server
ENTRYPOINT ["/server"]
docker-compose.yaml¶
services:
primary1:
build: .
primary2:
build: .
backup:
build: .
envoy:
image: envoyproxy/envoy:v1.31-latest
volumes:
- ./envoy.yaml:/etc/envoy/envoy.yaml
ports:
- "10000:10000"
- "9901:9901"
command: envoy -c /etc/envoy/envoy.yaml --log-level info
depends_on:
- primary1
- primary2
- backup
Envoy config: envoy.yaml¶
static_resources:
listeners:
- name: grpc_listener
address:
socket_address: { address: 0.0.0.0, port_value: 10000 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: grpc_proxy
codec_type: AUTO
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
route_config:
name: grpc_route
virtual_hosts:
- name: grpc_services
domains: ["*"]
routes:
- match: { prefix: "/", grpc: {} }
route:
cluster: grpc_cluster
timeout: 0s
retry_policy: # (1)
retry_on: "connect-failure,reset,resource-exhausted,unavailable"
num_retries: 3
per_try_timeout: 5s
http_filters:
- name: envoy.filters.http.grpc_stats
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.grpc_stats.v3.FilterConfig
emit_filter_state: true
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: grpc_cluster
connect_timeout: 5s
type: STRICT_DNS
lb_policy: LEAST_REQUEST
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
# --- CIRCUIT BREAKER --- (2)
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 100
max_pending_requests: 50
max_requests: 200
max_retries: 3
# --- OUTLIER DETECTION --- (3)
outlier_detection:
consecutive_gateway_errors: 5
consecutive_local_origin_failure: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 50
# --- ACTIVE HEALTH CHECKING ---
health_checks:
- timeout: 2s
interval: 5s
unhealthy_threshold: 2
healthy_threshold: 1
grpc_health_check: {}
# --- PRIORITY GROUPS --- (4)
load_assignment:
cluster_name: grpc_cluster
endpoints:
- priority: 0 # primaries — receives all traffic when healthy
lb_endpoints:
- endpoint:
address:
socket_address: { address: primary1, port_value: 50051 }
- endpoint:
address:
socket_address: { address: primary2, port_value: 50051 }
- priority: 1 # backup — only used when ALL priority 0 endpoints are unhealthy
lb_endpoints:
- endpoint:
address:
socket_address: { address: backup, port_value: 50051 }
admin:
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
- Retry policy on the route: retry on
connect-failure, stream reset, and gRPCUNAVAILABLE/RESOURCE_EXHAUSTEDstatus codes.per_try_timeoutlimits each individual attempt, whiletimeouton the route (set to0s) disables the total deadline. Whentimeout: 0sandper_try_timeout: 5s, each retry attempt has up to 5 seconds, but the total call can run indefinitely across all retries. - Circuit breaker limits prevent cascading failure during an overload event. The
max_retries: 3limit is particularly important — it prevents retry storms where retries generate more retries.priority: DEFAULTsets limits for normal traffic; you can also set separate limits forHIGHpriority traffic (used internally by health checks). - Outlier detection watches real traffic. After 5 consecutive errors from an endpoint, it is ejected for
base_ejection_time(30s). The second ejection lasts 60s, the third 90s, etc. At most 50% of the cluster can be ejected at once — so with 2 primaries, at most 1 is ejected at a time. - Priority 0 endpoints receive all traffic normally. Priority 1 only activates when Envoy determines that not enough priority-0 capacity is healthy (below ~71% with the default
overprovisioning_factor: 140).
Run it¶
Exercises¶
1. Verify only primaries serve traffic¶
for i in $(seq 1 8); do
grpcurl -plaintext -d '{"name": "test"}' \
localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
| grep message
done
All responses should show primary1 or primary2 container IDs — never backup.
2. Kill both primaries and watch failover¶
Wait 10–15 seconds for health checks to detect failure (unhealthy_threshold: 2 × interval: 5s = 10s). Then send requests:
for i in $(seq 1 6); do
grpcurl -plaintext -d '{"name": "test"}' \
localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
| grep message
done
Now you should see the backup container ID in all responses.
3. Watch priority state in the admin API¶
Look for health_flags — unhealthy endpoints show ::failed_active_hc.
For a structured view, use the JSON endpoint:
4. Restore primaries and verify traffic returns¶
Wait 10s, then send requests again. Traffic should shift back to the primaries automatically once they pass healthy_threshold: 1 health check.
5. Trigger outlier detection without stopping containers¶
The backend exposes a control HTTP endpoint. Find the container names:
Then switch primary1 to failing mode (adjust service name if needed):
# Send the /fail command to primary1's control port
docker compose exec primary1 wget -qO- http://localhost:8080/fail
Now send gRPC requests — primary1 will start returning UNAVAILABLE. After 5 consecutive errors, outlier detection ejects it:
for i in $(seq 1 20); do
grpcurl -plaintext -d '{"name": "test"}' \
localhost:10000 helloworld.Greeter/SayHello 2>/dev/null \
| grep message
done
Check outlier detection stats:
Recover the backend:
6. Observe circuit breaker¶
Check the overflow counters (they start at zero):
To trigger circuit breaking intentionally, you can reduce max_requests to a very low value (e.g. 1) and send concurrent requests with a load testing tool like ghz.
Key admin API stats for this step¶
| Stat | Meaning |
|---|---|
cluster.grpc_cluster.upstream_rq_retry |
Total retries performed |
cluster.grpc_cluster.upstream_rq_retry_overflow |
Retries rejected by circuit breaker |
cluster.grpc_cluster.upstream_rq_pending_overflow |
Requests rejected (pending limit hit) |
cluster.grpc_cluster.outlier_detection.ejections_active |
Currently ejected endpoints |
cluster.grpc_cluster.outlier_detection.ejections_total |
All-time ejection count |
What you learned¶
- Priority groups:
priority: 0for primaries,priority: 1for backup - Active health checks + outlier detection work together
- Circuit breaker thresholds per priority level
- Retry policy for transient gRPC errors
- How to trigger and observe failover through the admin API
Next step
In Step 5 you will move everything to Kubernetes using the Gateway API and dynamic endpoint discovery.