Overview
The Supervisor/Restart pattern involves a supervisor goroutine that monitors one or more worker goroutines. If a worker fails (panics or exits unexpectedly), the supervisor restarts it. This pattern is essential for building resilient and fault-tolerant systems, automatically recovering from transient errors, and ensuring critical tasks are always running.
NOTE: For other posts on concurrency patterns, check out the index post to this series of concurrency patterns.
Implementation Details
Structure
The supervisor/restart implementation in examples/supervisor.go consists of three main components:
- Supervisor – A goroutine that monitors and manages workers
- Worker – A goroutine that performs work and may fail
- Coordination – Channels and synchronization for monitoring and restart
Code Analysis
Let’s break down the main function and understand how each component works:
func RunSupervisor() {
var restarts int32
stop := make(chan struct{})
done := make(chan struct{})
// Supervisor goroutine
go func() {
for {
workerDone := make(chan struct{})
go workerWithFailure(workerDone, stop)
select {
case <-workerDone:
atomic.AddInt32(&restarts, 1)
fmt.Println("Supervisor: Worker failed, restarting...")
// Restart after a short delay
time.Sleep(500 * time.Millisecond)
case <-stop:
fmt.Println("Supervisor: Stopping worker supervision.")
close(done)
return
}
}
}()
// Let the supervisor run for a while
time.Sleep(4 * time.Second)
close(stop)
<-done
fmt.Printf("Supervisor example completed! Worker was restarted %d times.\n", restarts-1)
}
Step-by-step breakdown:
- State and Channel Setup:
var restarts int32creates an atomic counter to track restart attemptsstop := make(chan struct{})creates a signal channel for graceful shutdowndone := make(chan struct{})creates a completion channel for supervisor coordination- These channels use
struct{}as the type for efficiency (zero memory allocation)
- Supervisor Goroutine Launch:
- The supervisor runs in its own goroutine to monitor workers continuously
- Uses an infinite
forloop to continuously restart workers as needed - This ensures the supervisor is always ready to handle worker failures
- Worker Lifecycle Management:
workerDone := make(chan struct{})creates a fresh channel for each worker- Each worker gets its own done channel to signal completion or failure
go workerWithFailure(workerDone, stop)launches a new worker goroutine- The worker receives both the done channel (to signal back) and stop channel (for shutdown)
- Supervisor Monitoring Loop:
- Uses
selectstatement to wait for either worker completion or stop signal case <-workerDone:handles worker failure or completioncase <-stop:handles graceful shutdown request- This provides non-blocking monitoring of multiple events
- Uses
- Failure Handling and Restart Logic:
atomic.AddInt32(&restarts, 1)safely increments the restart counter- Atomic operation ensures thread safety without mutex overhead
time.Sleep(500 * time.Millisecond)provides a restart delay- Delay prevents “thundering herd” – rapid restart loops that could overwhelm the system
- Graceful Shutdown Process:
close(done)signals that the supervisor has finished its workreturnexits the supervisor goroutine- This ensures clean termination without goroutine leaks
- Main Function Coordination:
time.Sleep(4 * time.Second)lets the supervisor run for a demonstration periodclose(stop)signals the supervisor to stop monitoring<-donewaits for the supervisor to finish its shutdown processrestarts-1accounts for the initial worker launch (not a restart)
Worker Implementation
func workerWithFailure(done chan<- struct{}, stop <-chan struct{}) {
fmt.Println("Worker: Started")
workTime := time.Duration(rand.Intn(1200)+400) * time.Millisecond
select {
case <-time.After(workTime):
// Simulate random failure
if rand.Float32() < 0.6 {
fmt.Println("Worker: Simulated failure!")
done <- struct{}{}
return
}
fmt.Println("Worker: Completed work successfully.")
case <-stop:
fmt.Println("Worker: Received stop signal.")
}
// Signal normal exit
done <- struct{}{}
}
Worker function breakdown:
- Function Signature:
done chan<- struct{}: Write-only channel for signaling completion/failure to supervisorstop <-chan struct{}: Read-only channel for receiving shutdown signals- Directional channels provide encapsulation and prevent misuse
- Worker Initialization:
- Prints startup message for debugging and monitoring
workTime := time.Duration(rand.Intn(1200)+400) * time.Millisecondcreates variable work duration- Random duration between 400-1600ms simulates real-world processing variability
- Work Execution with Timeout:
- Uses
selectstatement to handle both work completion and stop signals case <-time.After(workTime):simulates work that takes a variable amount of timecase <-stop:handles graceful shutdown requests
- Uses
- Failure Simulation Logic:
if rand.Float32() < 0.6creates a 60% probability of failure- High failure rate (60%) demonstrates the restart mechanism effectively
done <- struct{}{}immediately signals failure to supervisorreturnexits the worker without sending a second done signal
- Successful Completion Path:
- If no failure occurs, worker prints success message
- Continues to the final
done <- struct{}{}to signal normal completion - This distinguishes between failure and successful completion
- Stop Signal Handling:
- If stop signal is received, worker prints acknowledgment
- Continues to final
done <- struct{}{}to signal graceful shutdown - This ensures supervisor knows the worker has stopped
- Final Signal:
done <- struct{}{}signals completion for both success and graceful shutdown cases- Ensures supervisor always receives a signal when worker exits
- Prevents supervisor from waiting indefinitely
Key Design Patterns:
- Channel Direction Safety: Using directional channels (
chan<-and<-chan) prevents accidental misuse and provides clear API contracts. - Atomic Counter: Using
atomic.AddInt32for restart counting provides thread safety without mutex overhead. - Select Statement: The
selectstatement in both supervisor and worker provides non-blocking event handling for multiple channels. - Fresh Channel per Worker: Creating a new
workerDonechannel for each worker ensures clean communication and prevents signal confusion. - Graceful Shutdown: Both supervisor and worker respond to stop signals, ensuring clean resource cleanup.
- Failure Simulation: High failure rate (60%) with variable work time demonstrates the restart mechanism effectively in a short demonstration period.
How It Works
- Supervisor Initialization: The supervisor starts and launches the first worker
- Worker Execution: The worker performs its task and may fail randomly
- Failure Detection: When a worker fails, it signals the supervisor via a channel
- Restart Logic: The supervisor waits briefly, then launches a new worker
- Continuous Monitoring: This cycle continues until the supervisor receives a stop signal
- Graceful Shutdown: The supervisor stops launching new workers and waits for current workers to finish
The pattern ensures that critical work continues even when individual workers fail.
Why This Implementation?
Channel-based Communication
- Non-blocking Monitoring: Supervisor can monitor multiple workers without blocking
- Immediate Failure Detection: Workers can signal failure immediately
- Clean Coordination: Channels provide natural synchronization
Atomic Counter for Restarts
- Thread Safety: Atomic operations ensure accurate restart counting
- Performance: Atomic operations are more efficient than mutex-based counting
- Simplicity: No need for complex synchronization around the counter
Separate Stop Channel
- Graceful Shutdown: Supervisor can stop monitoring without waiting for failures
- Clean Termination: Workers receive stop signals and can clean up properly
- Resource Management: Prevents infinite restart loops
Worker Done Channel
- Failure Signaling: Workers signal completion or failure to supervisor
- Immediate Response: Supervisor can react immediately to worker state changes
- Flexible Communication: Can distinguish between normal completion and failure
Restart Delay
- Prevent Thundering Herd: Brief delay prevents rapid restart loops
- Resource Recovery: Gives system time to recover from transient issues
- Stability: Prevents overwhelming the system with restart attempts
Key Design Decisions
- Simulated Failures: Random failures (60% probability) demonstrate the restart mechanism
- Variable Work Time: Random work duration simulates real-world variability
- Immediate Restart: Supervisor restarts workers immediately after failure detection
- Stop Signal Handling: Both supervisor and workers respond to stop signals
- Restart Counting: Tracks the number of restarts for monitoring purposes
Failure Handling Strategies
Immediate Restart
- Pros: Minimal downtime, immediate recovery
- Cons: May restart into the same failure condition
- Use Case: Transient failures, network issues
Exponential Backoff
- Pros: Prevents overwhelming the system, allows for recovery
- Cons: Longer downtime for legitimate failures
- Use Case: Persistent failures, resource exhaustion
Circuit Breaker
- Pros: Prevents cascading failures, allows system recovery
- Cons: More complex implementation
- Use Case: External service failures, resource constraints
Health Checks
- Pros: Proactive failure detection, better monitoring
- Cons: Additional complexity, false positives
- Use Case: Long-running processes, critical systems
Common Use Cases
System Services
- Background Workers: Ensure critical background tasks continue running
- Data Processing: Restart failed data processing jobs
- Scheduled Tasks: Ensure scheduled operations complete successfully
Microservices
- Service Health: Monitor and restart unhealthy service instances
- API Endpoints: Ensure critical API endpoints remain available
- Message Processors: Restart failed message processing workers
Monitoring Systems
- Health Monitors: Ensure monitoring agents continue running
- Alert Processors: Restart failed alert processing workers
- Metric Collectors: Ensure metric collection continues
Database Operations
- Connection Managers: Restart failed database connection pools
- Migration Workers: Ensure database migrations complete
- Backup Jobs: Restart failed backup operations
File Processing
- Upload Handlers: Restart failed file upload processors
- Download Workers: Ensure file downloads complete
- Processing Jobs: Restart failed file processing tasks
Network Services
- Connection Handlers: Restart failed network connection handlers
- Protocol Processors: Ensure protocol processing continues
- Load Balancers: Restart failed load balancing workers
IoT Applications
- Device Managers: Restart failed device communication workers
- Sensor Processors: Ensure sensor data processing continues
- Control Systems: Restart failed control system workers
Financial Systems
- Trading Engines: Restart failed trading algorithm workers
- Risk Calculators: Ensure risk calculations continue
- Settlement Processors: Restart failed settlement workers
Best Practices
Failure Detection
- Immediate Signaling: Workers should signal failure immediately
- Graceful Degradation: Workers should handle partial failures gracefully
- Health Reporting: Workers should report their health status
Restart Strategy
- Appropriate Delays: Use delays to prevent restart storms
- Backoff Policies: Implement exponential backoff for persistent failures
- Maximum Restarts: Limit the number of restarts to prevent infinite loops
Resource Management
- Cleanup: Ensure workers clean up resources before exiting
- Memory Management: Monitor memory usage to prevent leaks
- Connection Management: Properly close connections and file handles
Monitoring and Logging
- Restart Tracking: Log restart events for monitoring
- Performance Metrics: Track worker performance and restart frequency
- Alerting: Alert on excessive restart rates
The supervisor/restart pattern is particularly effective when you have:
- Critical Processes: Tasks that must continue running
- Unreliable Workers: Workers that may fail due to external factors
- Long-running Systems: Systems that need to operate continuously
- Fault Tolerance: Requirements for high availability and reliability
- Automatic Recovery: Need for self-healing systems
This pattern provides a robust foundation for building resilient systems that can automatically recover from failures and maintain continuous operation.
The final example implementation looks like this:
package examples
import (
"fmt"
"math/rand"
"sync/atomic"
"time"
)
// RunSupervisor demonstrates the supervisor/restart pattern.
func RunSupervisor() {
fmt.Println("=== Supervisor/Restart Pattern Example ===")
var restarts int32
stop := make(chan struct{})
done := make(chan struct{})
// Supervisor goroutine
go func() {
for {
workerDone := make(chan struct{})
go workerWithFailure(workerDone, stop)
select {
case <-workerDone:
atomic.AddInt32(&restarts, 1)
fmt.Println("Supervisor: Worker failed, restarting...")
// Restart after a short delay
time.Sleep(500 * time.Millisecond)
case <-stop:
fmt.Println("Supervisor: Stopping worker supervision.")
close(done)
return
}
}
}()
// Let the supervisor run for a while
time.Sleep(4 * time.Second)
close(stop)
<-done
fmt.Printf("Supervisor example completed! Worker was restarted %d times.\n", restarts-1)
}
// workerWithFailure simulates a worker that randomly fails
func workerWithFailure(done chan<- struct{}, stop <-chan struct{}) {
fmt.Println("Worker: Started")
workTime := time.Duration(rand.Intn(1200)+400) * time.Millisecond
select {
case <-time.After(workTime):
// Simulate random failure
if rand.Float32() < 0.6 {
fmt.Println("Worker: Simulated failure!")
done <- struct{}{}
return
}
fmt.Println("Worker: Completed work successfully.")
case <-stop:
fmt.Println("Worker: Received stop signal.")
}
// Signal normal exit
done <- struct{}{}
}
To run this example, and build the code yourself, check out this and other examples in the go-fluency-concurrency-model-patterns repo. That’s it for this topic, tomorrow I’ll post on the pub/sub pattern.
2 thoughts on “Go Concurrency Patterns(Supervisor/Restart Pattern)”
Comments are closed.