Just Gracefully Shutdown
Last time I wrote about async, I went on about how global event loops can bite you. Then I spent half a day debugging why I shot myself in the foot for exactly those reasons. Classic.
Global state mutation got me while implementing signal handling for graceful shutdown. You know, the kind that prevents data loss by finishing in-flight requests before your service dies.
Here’s what breaks, how to fix it, and why it matters. Full code examples live in the repository.
tl;dr
The gist is: signal handlers should coordinate shutdown, not execute it!
Let’s unpack the core issue and the salvage right away. My advice is to avoid the following pattern:
loop = asyncio.get_running_loop()
def handle_signal(sig: signal.Signals) -> None:
tasks = asyncio.all_tasks(loop) # <- this is the culprit!
for task in tasks:
task.cancel() # the execution happens right away
for sig in (signal.SIGTERM,):
loop.add_signal_handler(sig, lambda s=sig: handle_signal(s))
# code to spawn the tasks: TaskGroup, asyncio.create_task, ...
or its variations like loop.stop unless you have full control over the scheduled tasks. Canceling asyncio.all_tasks() terminates third-party library internals you don’t control, breaking their cleanup logic and causing data loss.
To my surprise, the internet is full of such advice—including standard library docs themselves.
Instead, do something like this:
loop = asyncio.get_running_loop()
event = asyncio.Event()
def handle_signal(s: signal.Signals, e: asyncio.Event):
e.set() # just signal, don't execute
# Register signal handlers
for sig in (signal.SIGTERM,):
loop.add_signal_handler(
sig, lambda s=sig, e=event: handle_signal(s, e)
)
async with asyncio.TaskGroup() as tg:
tasks = [
tg.create_task(worker()),
]
await event.wait()
# signal received, now it's time to execute the shutdown
for t in tasks:
t.cancel() # <- the solution, cancel just our own tasks
In case your curiosity has been depleted, stop reading and take the pattern to your production code. For those of you feeling adventurous, let’s dive in.
The Problem
Building on my previous efforts, here’s a slightly edited version of the task life cycle:
A rogue task is any task scheduled outside your TaskGroup - typically by library code you don’t control. Since it lives outside the TaskGroup, it won’t get cancelled when exceptions happen (by design). That’s actually fine - these are usually library internals with their own cleanup logic.
So the happy path flow looks like this:
The shaded parts are the except CancelledError blocks - where graceful shutdown actually happens.
The signal handler lives outside the event loop, so when we cancel tasks from there, we’re hitting the rogue task too. Even if our code responsibly calls c.close(), the bug still happens - the rogue task gets killed mid-flight 1.
It is an art to strike the right balance between using short-yet-academic and proper-but-overwhelming examples when illustrating the problem. Thus, let me do my best and introduce our toy application, data-async-shovel.
It’s dead simple - two asyncio tasks read_data and store_data talking via a queue. The goal is to make it behave nicely on SIGTERM without losing data.
The interesting bit is storage.py - a simple file writer with a twist: the writes are scheduled as separate tasks.
It has a close method that waits for scheduled tasks to finish. Since I already spoiled the bug, you can guess what happens - messing with global event loop state cancels the client’s internal tasks which causes data loss.
I discovered this while swapping clients in an otherwise working implementation. The original client used threading internally, so task cancellation had no effect. Then I switched to a client implemented via asyncio, and suddenly weird timeouts everywhere. Let me praise AnyIO which makes this whole thing dead simple if you’re using it.
The example is complete, yet there is still a lot of fun you can have with it. The following are left as exercises to the reader:
- Implement retrying around
graceful_runner. For example, imaginestorage.ready()raises a temporary exception which should be retried - Currently, the application runs forever. Implement coordinated shutdown after 1k messages have been read and figure out what else needs to be changed to make the implementation work (hint: race it!)
- Reimplement
Storage.closeusingasyncio.gather + asyncio.wait_forand try to figure out why the behavior changes
It is interesting to observe how the implementation complexity inevitably rises with every added feature.
Hope this saves you some debugging time and more importantly lost data in the production. See you next time at report from the trenches.
Footnotes
-
Technically depends on your signal handler. With
task.cancel(), the rogue task gets sentCancelledErrorvia.send()↩