Channel iteration and goroutine leak

Redowan DelowarJune 21, 20264 min read

I ran into the classic “range over a channel” leak while working on a custom cron scheduler. I’ve debugged it on prod many times before, but writing one myself in a small piece of code reminded me how easy it is to write bugs like this even when you know about it.

Here:

on each tick, the scheduler dispatches the jobs that are due
each job reports its outcome on a channel
one collector ranges over that channel to record the run

// cron/scheduler.go
func tick(due []Job) []outcome {
    results := make(chan outcome)

    var wg sync.WaitGroup
    for _, j := range due {
        wg.Add(1)
        go func() {
            results <- outcome{job: j.Name, err: j.Run()} // (1)
        }()
    }

    var log []outcome
    go func() {
        for r := range results { // (2)
            log = append(log, r)
            wg.Done()
        }
    }()

    wg.Wait()
    // (3) no close(results)
    return log
}

(1) each due job sends its outcome on the unbuffered channel
(2) the collector ranges over results, recording each outcome and marking it done
(3) once every job has reported, wg.Wait unblocks and tick returns

The producers are fine. Every send is matched by the collector’s receive, so each job goroutine sends once and exits. The collector is the problem. After the last outcome it loops back to the range and waits for the next value, but nothing ever closes the channel. So it blocks on that receive for the life of the process. Every tick leaks another one.

Drain that same channel by hand and it never leaks. Send three values and take exactly three:

ch := make(chan int)
go func() {
    <-ch
    <-ch
    <-ch
}()
ch <- 1
ch <- 2
ch <- 3

Three receives, then the goroutine returns. Swap those receives for a range and it leaks:

ch := make(chan int)
go func() {
    for range ch { // never ends: ch is never closed
    }
}()
ch <- 1
ch <- 2
ch <- 3

The two forms stop on different conditions. Three explicit receives stop on their own after the third value. A range keeps reading until the channel closes. Back in the scheduler, nothing closes results, so the ranging collector blocks on a receive that never completes.

The fix is the one line the buggy version is missing: close results once every job has reported. The range ends and the collector returns:

// cron/scheduler.go
// ...

    wg.Wait()
    close(results) // ends the range, the collector returns
    return log

Warning

Reaching for a buffered channel instead won’t fix this. A range ends only when the channel is closed. No matter how big the buffer is, the receiver keeps waiting for a close that never comes.

This is a fairly well-documented leak. Uber called it channel iteration misuse.

Typically you’d catch a leak like this with goleak:

wire up goleak
exercise the path that leaks in a test
the test fails with the stuck goroutine’s stack

I wrote about the goleak workflow in the early return leak post. But goleak only catches a leak when a test exercises the buggy path, and my scheduler tests never ran that path. So goleak never saw it.

What caught it was Go 1.27’s new leak profile. I was running it over my own code while writing about it, and it doesn’t need a test at all. It leans on the garbage collector to find goroutines blocked on something nothing can ever reach, and reports only those. Run it at debug=2 and the stuck collector shows up tagged (leaked):

goroutine 25 [chan receive (leaked)]:
main.tick.func2()
    leaky-tick/main.go:43 +0x60
created by main.tick in goroutine 1
    leaky-tick/main.go:42 +0x15c

main.tick.func2 is the collector, parked on the range at line 43. The profile finds leaks like this deterministically, with no false positives and without a test ever exercising the path.

Closing the channel stops the leak, but it leaves one odd bit: the WaitGroup is counting jobs while the collector calls Done.

The collector should only drain results. Job completion belongs to the goroutine that runs the job. Once every job returns, a waiter can close results, and the range can finish normally.

With wg.Go, the corrected version becomes:

// cron/scheduler.go
func tick(due []Job) []outcome {
    results := make(chan outcome)

    var wg sync.WaitGroup
    for _, j := range due {
        wg.Go(func() { // (1)
            results <- outcome{job: j.Name, err: j.Run()}
        })
    }

    go func() {
        wg.Wait()
        close(results) // (2)
    }()

    var log []outcome
    for r := range results { // (3)
        log = append(log, r)
    }
    return log
}

(1) wg.Go runs each job and calls Done when it returns, so each job marks its own completion
(2) a separate goroutine waits for every job, then closes results so the range can end
(3) tick drains results itself, so there is no separate collector goroutine

Forget the close here and tick blocks on the range after the last result. All producers have exited, so no one can send another value. It’s the same missing close, but now it fails as a deadlock instead of leaking a background collector.

The code is available in the example repo.