Thread by @patio11, In honor of someone’s bad bug today, I will retell a story [...]

In honor of someone’s bad bug today, I will retell a story of my worst bug:

Once upon a time I was the CEO and entire engineering team of a company which sent appointment reminders.

Each reminder was sent by a cron job draining a queue. That queue was filled by another cron job

Reminders could fail but the queue draining job had always been bulletproof and had never failed to execute or take more than a few seconds to complete. It ran every 5 minute.

So I had never noticed the queue *filling* job wasn’t idempotent.

Idempotent is a $10 word for a simple concept: an operation where you get the same result no matter how many times you run it.

Adding 2 + 2 is idempotent. Creating a new record in your database may not be; the number of rows in the DB goes up each time.

One day, for the first time ever, the queue draining job broke and could not be restarted. This was a result of a trivial code push I had made to an unrelated part of the code base late in the day, prior to an apartment move.

Of course, being responsible, I was paved immediately

However, that was back during those pre-iPhone years where my cell phone was “a useful tool” rather than “extension of brain”, and like many useful tools it ended up in a box on the moving truck, trilling merrily for 13 hours.

Later that night, while unboxing things, I got the page, realized there were thousands of undrained events in the queue, and panicked. So I reverted the bugged deploy and restarted the queue workers. Queue quickly drained.

Crisis averted, right?

At 2 AM in the morning I woke from a nightmare caused by system engineer spideysense. “Wait wait wait there were THOUSANDS of events on the queue? Shouldn’t it have been a couple dozen at that hour?”

And then I realized with dawning horror what I had done.

In the 13 hours the queue had been broken, cronjob #2 had been dutifully asking “Have we called Client of Customer #437 about their Friday appointment?”, gotten a no from the DB, and then dutifully queued up a call.

Every five minutes.

Resulting in 13 * (60 / 5) calls.

We did not spam the heck out of one client’s inbox. Oh no.

We spammed the heck out of every client, of every customer, who had an appointment that day.

But worse, because a key feature of our service was that it didn’t just email; it would escalate to SMS and then phone.

And since the blissfully ignorant queued calls thought “OMG I am so late better urgently tell the person about their appointment” most of them chose to escalate to a call.

Now there is a word for what happens when many independent systems simultaneously try to restart.

We call it a “thundering herd.” It routinely brings down systems built for massive scalability like web tiers, APIs, databases, etc.

You know what is not designed for massive scalability? A residential plain old telephone.

Plausibly you might not even know what happens if, say, 50 people all call you at once. I’ll tell you. They keep ringing indefinitely while you have a conversation and hang up, at which point your phone will immediately start ringing again.

You can repro w/ 50 patient friends.

Computers are very, very patient, and so they dutifully lined up and let the phone ring until it was answered or went to answering machine, 50 times consecutively.

Or, as more commonly happened, the immensely frustrated person physically disconnected their phone line.

Now this would have been bad enough. But what did those 50 calls start with?

“This is Dr. Smith’s office. He’d like to remind you that...”

So you can imagine who got the call an hour later. From every patient they were supposed to see that day.

Then they sent me an email.

And so it was that at 2 AM local time I had a long stack of very irate emails from literally every client of my company and no working internet to answer them from (new apartment).

I still had the keys to old apartment, which had working Internet. Problem: across town, no way to call taxi because small town Japan and no functioning phone.

Solution: pack up laptop, landline, and heater into backpack, walk across town to old apartment. In freezing rain.

And so at 3 AM, wet and shivering in a room with no light, I started making apology calls.

After the first two I broke down and called my dad, convinced I had just bankrupted my company. He talked me down.

I worked through the night on apologies.

We ended up losing exactly two accounts. One reactivated the next day, impressed that they rated “a personal apology from the CEO.”

“Everyone makes mistakes. Go easy on your engineer.”

That’s good advice.

I fixed the idempotency issue and added MANY safeguards. We never unintentionally duplicated a call again for the life of the company.

Nobody remembers this now except me, and engineers who I tell it to, when they think that they’ve just made the worst mistake of their career.

Latest Threads Unrolled: