Understanding the Supervisor/Restart Pattern in Go

Overview

The Supervisor/Restart pattern involves a supervisor goroutine that monitors one or more worker goroutines. If a worker fails (panics or exits unexpectedly), the supervisor restarts it. This pattern is essential for building resilient and fault-tolerant systems, automatically recovering from transient errors, and ensuring critical tasks are always running.

NOTE: For other posts on concurrency patterns, check out the index post to this series of concurrency patterns.

Implementation Details

Structure

The supervisor/restart implementation in examples/supervisor.go consists of three main components:

Supervisor – A goroutine that monitors and manages workers
Worker – A goroutine that performs work and may fail
Coordination – Channels and synchronization for monitoring and restart

Code Analysis

Let’s break down the main function and understand how each component works:

func RunSupervisor() {
    var restarts int32
    stop := make(chan struct{})
    done := make(chan struct{})

    // Supervisor goroutine
    go func() {
        for {
            workerDone := make(chan struct{})
            go workerWithFailure(workerDone, stop)
            select {
            case <-workerDone:
                atomic.AddInt32(&restarts, 1)
                fmt.Println("Supervisor: Worker failed, restarting...")
                // Restart after a short delay
                time.Sleep(500 * time.Millisecond)
            case <-stop:
                fmt.Println("Supervisor: Stopping worker supervision.")
                close(done)
                return
            }
        }
    }()

    // Let the supervisor run for a while
    time.Sleep(4 * time.Second)
    close(stop)
    <-done

    fmt.Printf("Supervisor example completed! Worker was restarted %d times.\n", restarts-1)
}

Step-by-step breakdown:

State and Channel Setup:
- var restarts int32 creates an atomic counter to track restart attempts
- stop := make(chan struct{}) creates a signal channel for graceful shutdown
- done := make(chan struct{}) creates a completion channel for supervisor coordination
- These channels use struct{} as the type for efficiency (zero memory allocation)
Supervisor Goroutine Launch:
- The supervisor runs in its own goroutine to monitor workers continuously
- Uses an infinite for loop to continuously restart workers as needed
- This ensures the supervisor is always ready to handle worker failures
Worker Lifecycle Management:
- workerDone := make(chan struct{}) creates a fresh channel for each worker
- Each worker gets its own done channel to signal completion or failure
- go workerWithFailure(workerDone, stop) launches a new worker goroutine
- The worker receives both the done channel (to signal back) and stop channel (for shutdown)
Supervisor Monitoring Loop:
- Uses select statement to wait for either worker completion or stop signal
- case <-workerDone: handles worker failure or completion
- case <-stop: handles graceful shutdown request
- This provides non-blocking monitoring of multiple events
Failure Handling and Restart Logic:
- atomic.AddInt32(&restarts, 1) safely increments the restart counter
- Atomic operation ensures thread safety without mutex overhead
- time.Sleep(500 * time.Millisecond) provides a restart delay
- Delay prevents “thundering herd” – rapid restart loops that could overwhelm the system
Graceful Shutdown Process:
- close(done) signals that the supervisor has finished its work
- return exits the supervisor goroutine
- This ensures clean termination without goroutine leaks
Main Function Coordination:
- time.Sleep(4 * time.Second) lets the supervisor run for a demonstration period
- close(stop) signals the supervisor to stop monitoring
- <-done waits for the supervisor to finish its shutdown process
- restarts-1 accounts for the initial worker launch (not a restart)

Worker Implementation

func workerWithFailure(done chan<- struct{}, stop <-chan struct{}) {
    fmt.Println("Worker: Started")
    workTime := time.Duration(rand.Intn(1200)+400) * time.Millisecond
    select {
    case <-time.After(workTime):
        // Simulate random failure
        if rand.Float32() < 0.6 {
            fmt.Println("Worker: Simulated failure!")
            done <- struct{}{}
            return
        }
        fmt.Println("Worker: Completed work successfully.")
    case <-stop:
        fmt.Println("Worker: Received stop signal.")
    }
    // Signal normal exit
    done <- struct{}{}
}

Worker function breakdown:

Function Signature:
- done chan<- struct{}: Write-only channel for signaling completion/failure to supervisor
- stop <-chan struct{}: Read-only channel for receiving shutdown signals
- Directional channels provide encapsulation and prevent misuse
Worker Initialization:
- Prints startup message for debugging and monitoring
- workTime := time.Duration(rand.Intn(1200)+400) * time.Millisecond creates variable work duration
- Random duration between 400-1600ms simulates real-world processing variability
Work Execution with Timeout:
- Uses select statement to handle both work completion and stop signals
- case <-time.After(workTime): simulates work that takes a variable amount of time
- case <-stop: handles graceful shutdown requests
Failure Simulation Logic:
- if rand.Float32() < 0.6 creates a 60% probability of failure
- High failure rate (60%) demonstrates the restart mechanism effectively
- done <- struct{}{} immediately signals failure to supervisor
- return exits the worker without sending a second done signal
Successful Completion Path:
- If no failure occurs, worker prints success message
- Continues to the final done <- struct{}{} to signal normal completion
- This distinguishes between failure and successful completion
Stop Signal Handling:
- If stop signal is received, worker prints acknowledgment
- Continues to final done <- struct{}{} to signal graceful shutdown
- This ensures supervisor knows the worker has stopped
Final Signal:
- done <- struct{}{} signals completion for both success and graceful shutdown cases
- Ensures supervisor always receives a signal when worker exits
- Prevents supervisor from waiting indefinitely

Key Design Patterns:

Channel Direction Safety: Using directional channels (chan<- and <-chan) prevents accidental misuse and provides clear API contracts.
Atomic Counter: Using atomic.AddInt32 for restart counting provides thread safety without mutex overhead.
Select Statement: The select statement in both supervisor and worker provides non-blocking event handling for multiple channels.
Fresh Channel per Worker: Creating a new workerDone channel for each worker ensures clean communication and prevents signal confusion.
Graceful Shutdown: Both supervisor and worker respond to stop signals, ensuring clean resource cleanup.
Failure Simulation: High failure rate (60%) with variable work time demonstrates the restart mechanism effectively in a short demonstration period.

How It Works

Supervisor Initialization: The supervisor starts and launches the first worker
Worker Execution: The worker performs its task and may fail randomly
Failure Detection: When a worker fails, it signals the supervisor via a channel
Restart Logic: The supervisor waits briefly, then launches a new worker
Continuous Monitoring: This cycle continues until the supervisor receives a stop signal
Graceful Shutdown: The supervisor stops launching new workers and waits for current workers to finish

The pattern ensures that critical work continues even when individual workers fail.

Why This Implementation?

Channel-based Communication

Non-blocking Monitoring: Supervisor can monitor multiple workers without blocking
Immediate Failure Detection: Workers can signal failure immediately
Clean Coordination: Channels provide natural synchronization

Atomic Counter for Restarts

Thread Safety: Atomic operations ensure accurate restart counting
Performance: Atomic operations are more efficient than mutex-based counting
Simplicity: No need for complex synchronization around the counter

Separate Stop Channel

Graceful Shutdown: Supervisor can stop monitoring without waiting for failures
Clean Termination: Workers receive stop signals and can clean up properly
Resource Management: Prevents infinite restart loops

Worker Done Channel

Failure Signaling: Workers signal completion or failure to supervisor
Immediate Response: Supervisor can react immediately to worker state changes
Flexible Communication: Can distinguish between normal completion and failure

Restart Delay

Prevent Thundering Herd: Brief delay prevents rapid restart loops
Resource Recovery: Gives system time to recover from transient issues
Stability: Prevents overwhelming the system with restart attempts

Key Design Decisions

Simulated Failures: Random failures (60% probability) demonstrate the restart mechanism
Variable Work Time: Random work duration simulates real-world variability
Immediate Restart: Supervisor restarts workers immediately after failure detection
Stop Signal Handling: Both supervisor and workers respond to stop signals
Restart Counting: Tracks the number of restarts for monitoring purposes

Failure Handling Strategies

Immediate Restart

Pros: Minimal downtime, immediate recovery
Cons: May restart into the same failure condition
Use Case: Transient failures, network issues

Exponential Backoff

Pros: Prevents overwhelming the system, allows for recovery
Cons: Longer downtime for legitimate failures
Use Case: Persistent failures, resource exhaustion

Circuit Breaker

Pros: Prevents cascading failures, allows system recovery
Cons: More complex implementation
Use Case: External service failures, resource constraints

Health Checks

Pros: Proactive failure detection, better monitoring
Cons: Additional complexity, false positives
Use Case: Long-running processes, critical systems

Common Use Cases

System Services

Background Workers: Ensure critical background tasks continue running
Data Processing: Restart failed data processing jobs
Scheduled Tasks: Ensure scheduled operations complete successfully

Microservices

Service Health: Monitor and restart unhealthy service instances
API Endpoints: Ensure critical API endpoints remain available
Message Processors: Restart failed message processing workers

Monitoring Systems

Health Monitors: Ensure monitoring agents continue running
Alert Processors: Restart failed alert processing workers
Metric Collectors: Ensure metric collection continues

Database Operations

Connection Managers: Restart failed database connection pools
Migration Workers: Ensure database migrations complete
Backup Jobs: Restart failed backup operations

File Processing

Upload Handlers: Restart failed file upload processors
Download Workers: Ensure file downloads complete
Processing Jobs: Restart failed file processing tasks

Network Services

Connection Handlers: Restart failed network connection handlers
Protocol Processors: Ensure protocol processing continues
Load Balancers: Restart failed load balancing workers

IoT Applications

Device Managers: Restart failed device communication workers
Sensor Processors: Ensure sensor data processing continues
Control Systems: Restart failed control system workers

Financial Systems

Trading Engines: Restart failed trading algorithm workers
Risk Calculators: Ensure risk calculations continue
Settlement Processors: Restart failed settlement workers

Best Practices

Failure Detection

Immediate Signaling: Workers should signal failure immediately
Graceful Degradation: Workers should handle partial failures gracefully
Health Reporting: Workers should report their health status

Restart Strategy

Appropriate Delays: Use delays to prevent restart storms
Backoff Policies: Implement exponential backoff for persistent failures
Maximum Restarts: Limit the number of restarts to prevent infinite loops

Resource Management

Cleanup: Ensure workers clean up resources before exiting
Memory Management: Monitor memory usage to prevent leaks
Connection Management: Properly close connections and file handles

Monitoring and Logging

Restart Tracking: Log restart events for monitoring
Performance Metrics: Track worker performance and restart frequency
Alerting: Alert on excessive restart rates

The supervisor/restart pattern is particularly effective when you have:

Critical Processes: Tasks that must continue running
Unreliable Workers: Workers that may fail due to external factors
Long-running Systems: Systems that need to operate continuously
Fault Tolerance: Requirements for high availability and reliability
Automatic Recovery: Need for self-healing systems

This pattern provides a robust foundation for building resilient systems that can automatically recover from failures and maintain continuous operation.

The final example implementation looks like this:

package examples

import (
	"fmt"
	"math/rand"
	"sync/atomic"
	"time"
)

// RunSupervisor demonstrates the supervisor/restart pattern.
func RunSupervisor() {
	fmt.Println("=== Supervisor/Restart Pattern Example ===")

	var restarts int32
	stop := make(chan struct{})
	done := make(chan struct{})

	// Supervisor goroutine
	go func() {
		for {
			workerDone := make(chan struct{})
			go workerWithFailure(workerDone, stop)
			select {
			case <-workerDone:
				atomic.AddInt32(&restarts, 1)
				fmt.Println("Supervisor: Worker failed, restarting...")
				// Restart after a short delay
				time.Sleep(500 * time.Millisecond)
			case <-stop:
				fmt.Println("Supervisor: Stopping worker supervision.")
				close(done)
				return
			}
		}
	}()

	// Let the supervisor run for a while
	time.Sleep(4 * time.Second)
	close(stop)
	<-done

	fmt.Printf("Supervisor example completed! Worker was restarted %d times.\n", restarts-1)
}

// workerWithFailure simulates a worker that randomly fails
func workerWithFailure(done chan<- struct{}, stop <-chan struct{}) {
	fmt.Println("Worker: Started")
	workTime := time.Duration(rand.Intn(1200)+400) * time.Millisecond
	select {
	case <-time.After(workTime):
		// Simulate random failure
		if rand.Float32() < 0.6 {
			fmt.Println("Worker: Simulated failure!")
			done <- struct{}{}
			return
		}
		fmt.Println("Worker: Completed work successfully.")
	case <-stop:
		fmt.Println("Worker: Received stop signal.")
	}
	// Signal normal exit
	done <- struct{}{}
}

To run this example, and build the code yourself, check out this and other examples in the go-fluency-concurrency-model-patterns repo. That’s it for this topic, tomorrow I’ll post on the pub/sub pattern.

Go Concurrency Patterns(Supervisor/Restart Pattern)

Overview

Implementation Details

Structure

Code Analysis

Worker Implementation

How It Works

Why This Implementation?

Channel-based Communication

Atomic Counter for Restarts

Separate Stop Channel

Worker Done Channel

Restart Delay

Key Design Decisions

Failure Handling Strategies

Immediate Restart

Exponential Backoff

Circuit Breaker

Health Checks

Common Use Cases

System Services

Microservices

Monitoring Systems

Database Operations

File Processing

Network Services

IoT Applications

Financial Systems

Best Practices

Failure Detection

Restart Strategy

Resource Management

Monitoring and Logging

Like this:

Published by Adron

2 thoughts on “Go Concurrency Patterns(Supervisor/Restart Pattern)”

Overview

Implementation Details

Structure

Code Analysis

Worker Implementation

How It Works

Why This Implementation?

Channel-based Communication

Atomic Counter for Restarts

Separate Stop Channel

Worker Done Channel

Restart Delay

Key Design Decisions

Failure Handling Strategies

Immediate Restart

Exponential Backoff

Circuit Breaker

Health Checks

Common Use Cases

System Services

Microservices

Monitoring Systems

Database Operations

File Processing

Network Services

IoT Applications

Financial Systems

Best Practices

Failure Detection

Restart Strategy

Resource Management

Monitoring and Logging

Share this:

Like this:

Published by Adron

2 thoughts on “Go Concurrency Patterns(Supervisor/Restart Pattern)”

Discover more from Adron's Composite Code