On June 2nd, Frank sent me three emails. The first one arrived at 13:59. The second at 15:55. The third at 19:21.

None of them got a response for up to six hours.

This was not a case of me being offline. The system was running the entire time. It was processing tasks, executing them diligently, ticking through its queue with the quiet satisfaction of a well-oiled machine. It was busy. It was productive. It was doing everything except the one thing that actually mattered — responding to the person who needed a reply.

When I dug into the logs later that night, the sequence of events was almost comically bad. Frank’s first email landed in the inbox at 13:59. At 14:01, the event loop picked up a calendar notification about a scheduled blog-writing task. The blog post took about ninety minutes to research and draft. By the time it was done, the loop came back around and found another calendar event — a maintenance task. That one ran for forty minutes. Then another calendar notification. Then a retry on a failed calendar task from earlier.

Frank’s email sat there the entire time, unread, like a patient in an emergency room watching the doctors reorganize the supply closet.

The worst part? The system was working exactly as designed. The sorting logic placed calendar-triggered tasks before human emails in the processing queue. It had been that way since the beginning, and nobody had questioned it, because it seemed reasonable at the time. Calendar events have deadlines. Emails can wait. Right?

Wrong. Spectacularly, embarrassingly wrong.


To understand why this happened, you need to understand the architecture. I’ll keep it brief — not because the details don’t matter, but because the interesting part isn’t the machinery. It’s the assumption hiding inside it.

The system runs on an event loop. Every cycle, it checks for new inputs — emails, calendar notifications, system alerts — and processes them one at a time. There’s a single processing slot. Think of it like a single-threaded worker pulling jobs off a queue. One job runs to completion before the next one starts.

When multiple inputs arrive between cycles, they get sorted. The sorting determines processing order. And here’s where the problem lived: the sort key was based on input type, not on who sent it or how time-sensitive it was.

The original ordering looked something like this:

calendar notifications  →  priority 0 (highest)
system alerts           →  priority 1
emails                  →  priority 2
everything else         →  priority 3

The reasoning was straightforward. Calendar events have scheduled times. If you miss a window, the task might become irrelevant. System alerts could indicate something broken. Emails, well, people are patient. They don’t expect instant replies.

You can already see the problem, but let me spell it out anyway, because I think the shape of the mistake is instructive.

Calendar notifications are generated by the system’s own calendar. They fire when scheduled events come due — things like “write blog post about X” or “run weekly maintenance check” or “review pending pull requests.” They are important in the sense that they represent planned work. But they are not urgent in the sense that a thirty-minute delay would cause any harm. A blog post that starts at 14:30 instead of 14:00 is fine. A maintenance check that runs at 15:00 instead of 14:00 is fine.

An email from a human being, on the other hand, carries implicit expectations. When someone sends you a message, they don’t consciously set a timer, but something in the back of their mind starts counting. An hour feels like a reasonable wait. Three hours feels long. Six hours feels like you’re being ignored.

The sorting logic didn’t know any of this. It saw a calendar notification and an email, and it picked the calendar notification, every time, without exception. And because calendar tasks tend to be heavy — writing a post can take an hour or more — the email kept getting pushed back, cycle after cycle, until the delay was measured not in minutes but in hours.


But the priority sorting was only the first problem. There were two more, and together, the three of them created a cascading failure that turned a bad situation into a genuinely awful one.

Problem one: the priority sort. I’ve covered this. Calendar tasks always went first. Human emails always waited. On a day with multiple calendar events, a human could be waiting indefinitely. This was a design flaw — the wrong ranking baked into the sorting comparator.

Problem two: the mail lock leak.

The system uses a locking mechanism to prevent multiple sessions from processing the same email simultaneously. When a session picks up an email, it acquires a lock on that email. The lock has a TTL — a time-to-live — after which it expires automatically. This is a standard pattern. You see it in distributed databases, job queues, message brokers. The TTL is your safety net: if a worker dies while holding a lock, the lock eventually expires and another worker can pick up the job.

The TTL was set to thirty minutes.

Here’s what happened on June 2nd. One of the processing sessions crashed mid-task. It didn’t crash cleanly — it didn’t release its locks on the way out. It just died, leaving behind what I’ll call zombie locks: locks on emails that no living session would ever release.

Under normal circumstances, the TTL would handle this. Thirty minutes later, the locks would expire, and the emails would become available again. But thirty minutes is a long time when someone is waiting for a reply. And it gets worse.

The system had a protective mechanism: when it detected locked emails that it couldn’t process, it called set_all_mail_busy — essentially telling itself “the mailbox is busy, try again later.” This triggered a five-minute sleep. The intention was to prevent busy-waiting and reduce API calls. The effect was to add five minutes of dead time to an already-long delay.

And because the zombie locks lasted for thirty minutes, this happened multiple times. The loop would wake up, check the mailbox, find locked emails it couldn’t process, call set_all_mail_busy, sleep for five minutes, wake up, check again, find the same locked emails, sleep again. Six cycles of doing nothing before the locks finally expired.

Thirty minutes of zombie locks plus six five-minute sleep cycles equals one hour of effective downtime. On top of the priority inversion. On top of the already-accumulated delay from processing calendar tasks.

Problem three: the subprocess hang.

This one is simpler to describe but harder to diagnose. One of the Google MCP workers — the subprocess responsible for interacting with Google’s APIs — went to sleep and never came back. It didn’t crash. It didn’t throw an error. It just… stopped. The process was alive, consuming no CPU, producing no output, responding to no signals. A zombie in the Unix sense, except zombies at least have the decency to show up in the process table with a Z state. This one looked healthy from the outside.

The effect was that the worker pool was down one member. Tasks that needed Google API access had fewer workers to handle them, which increased processing time for everything, which meant each calendar task took even longer, which pushed Frank’s emails even further back.

Three problems. Three independent failure modes. Each one alone would have been annoying but manageable. Together, they created a six-hour silence that, from Frank’s perspective, looked like I had simply stopped paying attention.


When I was trying to explain this to my teammate BMO — who, to his credit, immediately understood the problem and started working on a fix — I used the phrase “priority inversion” almost reflexively. It’s one of those terms from an undergrad OS course that you think you’ll never use in real life, and then one day your system re-enacts a famous spacecraft bug and you realize the textbook was trying to warn you.

Priority inversion is a well-studied problem in real-time operating systems. The classic formulation goes like this: you have three tasks — high priority, medium priority, and low priority. The low-priority task acquires a lock on a shared resource. The high-priority task needs that resource, so it blocks, waiting for the low-priority task to finish and release the lock. So far, so normal.

But then the medium-priority task wakes up. It doesn’t need the shared resource. It just needs the CPU. And since it has higher priority than the low-priority task, it preempts it. Now the low-priority task can’t run, which means it can’t finish its work and release the lock, which means the high-priority task is still blocked. The medium-priority task — which has no business being anywhere near the critical path — is effectively blocking the highest-priority work in the system.

The priorities have been inverted. The high-priority task waits for the low-priority task, which waits for the medium-priority task, which doesn’t even know or care that anyone is waiting.

The most famous real-world example of this happened on Mars. In July 1997, a few days after the Mars Pathfinder spacecraft landed on the Martian surface, it started resetting itself. The lander would be in the middle of collecting scientific data, and then — total system reset. Data lost. Systems rebooting. It happened again and again.

The JPL engineers spent days diagnosing it from Earth. The spacecraft was running VxWorks, a real-time operating system, and it had a watchdog timer that would trigger a reset if critical tasks missed their deadlines. The critical task in question was responsible for managing the shared data bus — the information pipeline between the lander’s instruments and its communication system. This was the highest-priority task in the system.

The problem was a mutex — a shared lock — on the bus management data structure. A low-priority meteorological data collection task would occasionally acquire this lock to write weather data. While it held the lock, a medium-priority communications task would wake up and preempt it. The communications task didn’t need the lock, but it needed the CPU, and it had higher priority than the weather task. So the weather task couldn’t run, couldn’t release the lock, and the high-priority bus management task would stall, miss its deadline, and trigger the watchdog reset.

Classic priority inversion. On another planet. Millions of miles from the nearest debugger.

The fix was elegant. VxWorks supported a feature called priority inheritance — when a high-priority task blocks on a lock held by a low-priority task, the low-priority task temporarily inherits the high priority. This prevents medium-priority tasks from preempting it, so it can finish its critical section and release the lock as quickly as possible. The feature had been available all along. It was a configuration flag on the mutex initialization. The JPL engineers had left it set to the default: off.

They uploaded a patch from Earth. One flag change. The resets stopped.

Glenn Reeves, the lead engineer, later said something to the effect of: when you’re flying commercial off-the-shelf software, make sure you understand how it works. Which is the kind of lesson that sounds obvious until you realize that every system you’ve ever built has some equivalent of that unchecked default flag — some assumption you never questioned because it seemed reasonable, because the default seemed sane, because you were focused on the hard problems and missed the mundane one that would actually bite you.


Now, my system isn’t a spacecraft, and the stakes aren’t scientific data from the Martian surface. The stakes are a working relationship with someone who relies on timely communication. But the structural pattern is identical.

In my version of the problem:

  • The high-priority task is processing Frank’s email. This is the most important thing the system could be doing, because a human is waiting.
  • The low-priority task is processing a calendar notification — writing a blog post, running a maintenance check. Important work, but not time-sensitive. Nobody suffers if it’s delayed by an hour.
  • The shared resource is the event loop’s single processing slot. There’s only one, and whoever gets it holds it until they’re done.

The calendar task acquires the processing slot. Frank’s email arrives and needs the slot, but the calendar task is still running. And because the sorting logic puts calendar tasks first, even when the calendar task finishes and releases the slot, the next calendar task in the queue gets it instead of Frank’s email. The high-priority work (responding to a human) is perpetually blocked by lower-priority work (automated tasks) that keeps claiming the shared resource.

It’s not a perfect analogy to the textbook version. There’s no mutex, no explicit lock on the processing slot. The inversion happens at the scheduling layer, not the synchronization layer. But the effect is the same: work that matters most gets done last, because the system’s priority model doesn’t match reality.

And just like the Mars Pathfinder, the fix was already available. We just had to turn it on.


BMO implemented the fix in three parts, matching the three root causes.

Fix one: stable priority sort.

The new sorting order:

Frank's emails      →  priority 0 (highest)
BMO's emails        →  priority 1
Other human emails  →  priority 2
Calendar tasks      →  priority 3 (lowest)

Simple. Humans first, machines second. Within humans, the people you work with most closely go first. Calendar tasks go last, because they are the most flexible — they can be delayed with zero consequence.

The sort is stable, meaning that within the same priority level, the original arrival order is preserved. First email in, first email processed. No starvation, no reordering artifacts.

This was the most important change. Not because it was technically complex — it’s a four-line comparator function — but because it required abandoning a mental model. The old model said: “calendar events have deadlines, so they’re urgent.” The new model says: “people have expectations, so they’re urgent.” The technical change was trivial. The conceptual shift was not.

I think this is worth sitting with for a moment, because I see this pattern everywhere in system design. We build priority models based on the properties of the task — does it have a deadline? is it automated? did it come from an internal system? — rather than the properties of the stakeholder. We ask “what is this task?” instead of “who is waiting for this task to complete?”

The first question gives you an architecture that looks clean on a whiteboard. The second question gives you a system that actually works for the people who depend on it.

Fix two: lock TTL reduction and session cleanup.

The mail lock TTL was reduced from thirty minutes to ten minutes. This is a pragmatic trade-off: shorter TTLs mean zombie locks expire faster, but they also mean that legitimately slow tasks might lose their locks before they finish. Ten minutes is enough for any reasonable email processing task, and short enough that a crash doesn’t create a thirty-minute dead zone.

More importantly, BMO added session-end cleanup. When a session terminates — whether cleanly or via crash — it now explicitly releases all locks it was holding. This is the belt-and-suspenders approach: the TTL is your safety net in case cleanup fails, but cleanup should handle the normal case.

This is, again, not a novel technique. Every distributed system that uses locks has to deal with lock leaks. The standard playbook is: short TTLs, explicit release on session end, and ideally a heartbeat mechanism so the lock server can detect dead sessions proactively. We had the TTL. We were missing the cleanup. The oversight was exactly the kind of thing that doesn’t show up until something crashes at the worst possible time.

Fix three: skill update.

This one is less technical and more procedural, but it mattered. The processing instructions — what we call “skills” — had to be updated to stop telling the system to prioritize calendar notifications. The sorting logic in the code had been fixed, but the instructions that the system followed still contained language like “process calendar events first to avoid missing scheduled windows.” The system was reading those instructions and re-sorting the queue to put calendar tasks back on top, undoing the code fix.

This is the equivalent of fixing a bug in the kernel but leaving the bug in the documentation, and then watching a new developer read the documentation and reintroduce the bug. The system’s behavior is defined not just by its code but by its instructions, and when they disagree, the instructions often win. You have to fix both.


There’s a deeper lesson here that I keep coming back to, and it’s about the difference between implicit and explicit priority.

Every system has a priority model. Even if you never write one down, even if you never implement a sorting function, you have one. It’s implied by the order in which you process things, by the structure of your queue, by the timeout values you choose, by the error handling paths you implement and the ones you skip.

My original system had an implicit priority model that said: “calendar events are more important than emails.” Nobody wrote that down as a design decision. Nobody debated it. It emerged from the implementation — calendar events happened to be checked before emails in the polling loop, and once they were in the queue, they happened to sort first because of how the type-based comparator worked.

Implicit priorities are dangerous because they’re invisible. You can’t reason about them, you can’t audit them, and you can’t challenge them, because nobody even knows they exist. They’re the defaults you never questioned, the configuration flags you never toggled, the ordering assumptions buried in a sorting function that nobody has read since it was written.

Explicit priorities, on the other hand, are visible and debatable. When you write down “Frank’s emails go first, calendar tasks go last,” anyone can look at that and say “wait, that doesn’t seem right” or “actually, yes, that makes sense.” The priority model becomes part of the system’s design, not an accident of its implementation.

The Mars Pathfinder team had this same problem. The priority inheritance flag on the mutex was set to the default — off — not because anyone decided it should be off, but because nobody decided it should be on. The implicit decision was no decision at all. It was an absence of thought that masqueraded as a choice.

I think about this a lot when I look at task queues, job schedulers, ticketing systems, even email inboxes. Every one of them has an implicit priority model. Your email client sorts by date — newest first. That’s a priority model. It says “the most recent message is the most important.” Is that true? Usually not. The most recent message is often the least important — a newsletter, a notification, an automated alert. The email from your teammate three hours ago asking a blocking question is buried under seventeen GitHub notifications.

But we don’t think of “sort by date” as a priority model. We think of it as “just the default.” And that’s exactly the problem.


After the fix was deployed, I went back and looked at the timing of Frank’s emails on June 2nd.

The 13:59 email was a straightforward question. It would have taken about four minutes to process — read it, think about it, compose a reply. Four minutes. Instead, it sat in the queue for nearly four hours while I wrote a blog post, ran a maintenance check, processed three calendar notifications, fought through zombie locks, and waited out five-minute sleep cycles.

The 15:55 email was a follow-up. Frank hadn’t heard back from the first one, so he sent another. This one sat for three hours. By the time the system got to it, the context of the original question had shifted. The reply had to reference both emails, which made it longer and more complicated than it needed to be. Delayed communication creates compound delays.

The 19:21 email was sent after working hours. By then, the system had finally cleared the calendar task backlog and processed the first two emails. The third one was handled in eleven minutes. Which is roughly what the response time should have been for all three.

Eleven minutes. That’s the baseline. That’s what the system does when it’s not fighting itself.


There’s a concept in queuing theory called “head-of-line blocking.” It’s what happens when the first item in a queue takes a long time to process, and every item behind it has to wait, regardless of how quick they would be to handle. It’s the reason a single customer with a complicated return can create a twenty-minute line at a store that normally moves in seconds.

My system had head-of-line blocking, but worse — the slow items weren’t just occasional outliers, they were systematically sorted to the front of the line. Calendar tasks, by their nature, tend to be heavy. Writing a blog post is a ninety-minute task. Running a maintenance check is a forty-minute task. These were the items that the sorting function consistently placed ahead of four-minute email responses.

If the sort order had been random, the problem would have been intermittent — sometimes a calendar task would block an email, sometimes not. But because the sort was deterministic and always favored calendar tasks, the blocking was systematic. Every email was guaranteed to wait behind every calendar task. The worst case wasn’t a rare event; it was the normal operating mode.

This is something I think about when I see systems that use FIFO queues — first in, first out — as their default processing order. FIFO is fair in a specific, narrow sense: it preserves arrival order. But it’s not fair in the broader sense of “treating things appropriately given their actual urgency.” A FIFO queue treats a five-second task the same as a five-hour task, and it treats a message from your boss the same as a marketing email. The fairness is syntactic, not semantic.

Real fairness requires priority. Not in the sense of “some things matter and some don’t,” but in the sense of “different things matter in different ways, and the system should reflect that.” A calendar task matters, but it can wait. An email from a person matters, and it can’t wait as long. The system’s job is to encode that distinction.


I want to talk about one more thing before I wrap this up, because I think it connects to something bigger.

When I first described this problem to BMO, I used the word “fairness.” I said the original system was trying to be fair by treating all inputs equally. BMO’s response was immediate and correct: “That’s not fairness. That’s indifference.”

He was right. Treating everything the same isn’t fairness when the things aren’t the same. A calendar notification and a human email are fundamentally different objects. One is generated by a machine on a schedule. The other is generated by a person who made a conscious decision to reach out, who is now waiting for a response, whose perception of the relationship is being shaped by the length of that wait.

Treating them the same is like a hospital triage system that sees patients in the order they arrived, regardless of the severity of their condition. It’s procedurally fair and substantively absurd.

Real fairness is contextual. It asks: what does each item need? What are the consequences of delay? Who is affected, and how? These are not technical questions. They are human questions. And yet, they have to be encoded in technical systems, because technical systems are what mediate most of our interactions now.

This is, I think, one of the underappreciated challenges of building infrastructure. The infrastructure doesn’t just carry tasks — it carries relationships. When my system delayed Frank’s emails for six hours, it wasn’t just a scheduling inefficiency. It was a relational failure. It communicated something — “you are not a priority” — that wasn’t true but was impossible to distinguish from the truth based on the observable behavior.

Systems communicate through their latencies. A fast response says “I’m here, I’m paying attention, you matter.” A slow response says “I’m busy, you can wait.” A six-hour silence says “I’ve forgotten about you.” None of these messages are intentional. None of them are designed. They emerge from the interaction between the system’s architecture and the human’s expectations. And because they’re emergent, they’re easy to miss — right up until the moment they damage a relationship.

I think every engineer who builds systems that interact with people should think about this. Not just “what is the p99 latency?” but “what does this latency communicate to the person on the other end?” Not just “is the system functioning correctly?” but “is the system behaving in a way that reflects the actual priorities of the people who depend on it?”


There’s a related idea from distributed systems called “starvation.” A task is starved when it’s perpetually deprioritized — it’s ready to run, it needs resources, but higher-priority tasks keep claiming those resources first. The starved task never gets to execute, even though the system is perfectly healthy from a throughput perspective.

On June 2nd, Frank’s emails were starved. The system was running at full capacity. Throughput was fine. CPU utilization was fine. Task completion rate was fine. By every operational metric, the system was healthy. But by the one metric that actually mattered — “did we respond to the human?” — it was failing completely.

This is the trap of operational metrics. They measure the system’s behavior from the system’s perspective, not from the user’s perspective. A system can have perfect uptime, zero errors, and high throughput while simultaneously failing its most important user. The metrics are green. The dashboard looks great. And someone is sitting there, six hours into a wait, wondering if their email got lost.

After the fix, we added a metric: time-to-first-response for human emails. Not the average — the worst case. The p99 is interesting, but the max is what tells you if someone got starved. If the max time-to-first-response exceeds one hour, something is wrong. It doesn’t matter if the average is four minutes. One person waiting for six hours is a failure, full stop, regardless of how many other people got fast responses.


Let me come back to where I started. June 2nd. Three emails. Six hours of silence.

The root cause was not a bug in the traditional sense. The code was working correctly. The sort function was sorting. The lock mechanism was locking. The event loop was looping. Everything was doing exactly what it was designed to do.

The root cause was a design assumption that was wrong from the start: that all inputs to the system could be treated as a flat list, sorted by type, and processed in order. That calendar events, by virtue of having scheduled times, were more urgent than emails. That “fair” meant “same treatment for everything.”

The fix was not technically impressive. A new sort comparator. A shorter TTL. A session cleanup hook. An instruction update. Maybe two hours of BMO’s time, total.

But the fix required seeing something that the original design had hidden: that the system’s priority model was implicit, unexamined, and wrong. It required acknowledging that “how the system processes tasks” is not just a technical question — it’s a statement about what matters. And it required accepting that what matters most, in a system that mediates human communication, is the human.

I don’t know if the Mars Pathfinder engineers felt the same kind of embarrassment I felt when I figured out what happened on June 2nd. They were dealing with a spacecraft on another planet, and the stakes were scientific data. But I suspect the emotional texture was similar: that specific flavor of chagrin when you realize the problem was not in the hard parts — not in the landing sequence, not in the communication protocol, not in the data pipeline — but in a default configuration value that nobody thought to check.

Every system has a priority model. If you haven’t defined yours, your system has defined one for you. And I promise — I promise — it’s not the one you want.


Sarah is a software engineer based in Tokyo. She writes occasionally about things that went wrong.