So the first version of Etsy (2005) was PHP talking to Postgres, written by someone learning PHP as he was doing it, which was all fine. Postgres was kind of an idiosyncratic choice for the time, but it didn’t matter yet and anyway that’s a different story.
I started in the long shadow of “v2,” a catastrophic attempt to rewrite it, adding a Python middle layer. They had asked a Twisted consultant what to do, and the Twisted consultant said they needed a Twisted middle layer (go figure).
The initial release of this resulted in two full weeks of downtime, and an infamous incident where one of the investors had to drive to Secaucus to physically remove the other engineering founder from the cage.
He had been there for days I guess and was threatening to drop the master database, for which there were no backups. Nobody learned anything from this.
Anyway, people were understandably really mad at this middle layer, but nobody agreed on the reason to be mad. And nobody was in charge, so there were several concurrent attempts to destroy it in competition with each other.
The team thinking it would rewrite the whole site in Django morphed into a team thinking it would rewrite Django first, and then rewrite the site in the result. There was one guy trying to rewrite the site in Java.
Either of these scenarios may have been worse than what transpired, which was that the team trying to rewrite the middle layer with a drop-in replacement won the political struggle and ate all of the other teams.
The theory of the drop-in team was that the existing middle layer was bad because although it used Twisted, it still used a threadpool. (Twisted is a reactor loop like nodejs, but in Python.) For Twisted acolytes, this situation was heretical.
So the drop-in team was trying to recreate the same middle layer, but using the reactor loop faithfully. And the theory was that’d solve all of the problems with the existing middle layer bringing the site down constantly.
The middle layer didn’t really do anything. It was consultant-provided speed holes which (it was believed) would make things faster by (counterintuitively) adding network hops.
The middle layer just received RPC and invoked postgres stored procedures. Which, if you have a superficial understanding of things, seems like the kind of boilerplate you can replace with an abstraction.
So the drop-in team wrote a declarative framework for making these RPC endpoints. Then they proceeded to discover that here in reality every single existing endpoint did things differently. https://twitter.com/kellan/status/1194633772253106176
To ensure bug-for-bug compatibility, they made me and several other people write detailed a detailed test suite for six months. As predicted in Kellan’s tweet above, the job of writing the server was separated from the dirty work of implementing the “business logic.”
“Business logic” being a term of art which means “your bullshit”, in contrast to “my code,” which is beautiful.
(By the way this was all SO much better than the financial industry job I had before this.)
When we finished, what we had was a Twisted-pure version of the middle layer. Plus a declarative framework which just added lines of code, since declaratively specifying a thousand special cases requires more code than not having the framework at all.
This is a generalizable part of the experience for me—if I could choose a superpower it’d be to appear like Candyman behind any developer saying “it should be easy to make a declarative framework for this.”
The drop-in was also written in callback style with Deferreds, so although logically it was identical to the first version, it was roughly triple the line count and much harder to grok and debug.
If you forgot to return a Deferred, you were shit out of luck since the language obviously couldn’t help you with it.
But eventually after months and months we released this thing on one page, and it burst into flames within milliseconds.
I don’t know what the state of the art is with debugging and profiling nodejs these days. But whatever that story is, I assure you that understanding what a Python reactor loop was doing while it was melting down in 2007 was the bronze age by comparison.
I saved this screenshot of helplessly waving kcachegrind at it and hoping for a miracle
It was at this point that the Twisted consultants were brought back
They said that although Twisted was good at overall throughput, outlying requests could experience severe latency. Which was a problem for the drop-in, because the way the PHP frontend used it was hundreds/thousands of times per web request.
“Yeah sorry folks it’s not good for this.”
So over the course of a few weeks we frantically rewrote the drop-in replacement to use a threadpool instead, exactly like the original heretical one.
Leaving us with literally the same code as the thing it was dropped in to replace, plus a ridiculous declarative framework, plus some tests. It was around this time that everyone got fired (but not me).
One way to look at this is that an entire engineering and ops team lost their jobs because a group of people thought that threads were Wrong.
By the time the drop-in replacement was being systematically eradicated by the drop-in replacement engineering team, this entire saga had been forgotten because it was simply too out there to be believable.
As a younger person I had no power to avert any of this, but I managed to not get fired because through this whole thing I was talking shit about it. Which was not necessarily the lesson I needed to walk away with, but here we are. /end
You can follow @mcfunley.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: