The most common feature-flagging pain I hear of is "feature flag debt" - stale flags clogging up your codebase with conditionals and dead code.

Uber just open-sourced Piranha, an internal tool for detecting and cleaning up stale feature flags.

Let's talk about it a bit...
Besides being an interesting approach to a very common problem, their discussion of Piranha also provides some very interesting insights into an organization that's *heavily* invested in feature flagging...
And when I say heavily invested, I mean it. The paper describes 6601 feature flags, just in their mobile codebases. Now apparently that's 7.7 *million* lines of code, but regardless it's still a lot of feature flags.
As one would expect given that count, Uber use flags for a lot of purposes. Experimentation, controlled rollout, safety, and sometimes for long-term operational things like kill switches and testing in prod (more on that in a moment).
They seem to be big fans of managing features by geo/market - turning on a feature for a specific market out before rolling it out more broadly.

I've seen this approach at other orgs that are naturally segmented by market. Makes it easier to understand the scope of a rollout.
Uber's flagging system has the ability for flags to be more than just on or off - a flag state can be an integer or a double, for example. However, these "parameter flags" comprise only a "very small set of flags" - the rest are simple on/off booleans.
Uber's flagging system has the ability for flags to be more than just on or off - a flag state can be an integer or a double, for example. However, these "parameter flags" comprise only a "very small set of flags" - the rest are simple on/off booleans.
All these flags come with a cost. They make Uber's code harder to understand, and slow down build+test.

They also introduce the risk of accidentally flipping a stale flag in prod, with untested and scary consequences (c.f. the Knight Capital Knightmare) https://www.bugsnag.com/blog/bug-day-460m-loss
Interestingly, the paper *doesn't* mention a cost that I often hear people concerned about - the combinatorial challenge of testing a large number of flags. I suspect this is because Uber have been feature flagging long enough to have developed good practices to mitigate this.
So, too many flags are bad - Uber should remove stale flags. However, identifying inactive flags is a surprisingly tricky problem. Some are "kill switches" which are almost never used, but need to be left available in prod. Others are in place for debugging or testing in prod.
as an aside, I think that a flag categorization scheme would help here, making it easier to distinguish between long-lived feature flags like these, vs short-lived Release or Experiment Toggles.

See https://www.martinfowler.com/articles/feature-toggles.html#CategoriesOfToggles for more discussion of this idea.
The paper also discusses the idea of requiring an expiration date when a flag is initially defined. This is another idea I've advocated for, and seen a few teams have some success with. Not a panacea though.
Lack of flag ownership looks to be a considerable problem at Uber. I'd think in part due to the nature of their org - hyper-growth, lots of churn and movement of folks. They also focus on ownership of flags by an individual dev, rather than a team. That seems questionable to me.
They also discuss a lack of aligned incentives when it comes to cleaning up feature flag debt - working on new features is rewarded more. I was very interested to hear that iOS devs have a more concrete incentive to remove stale flags - a limit from Apple on app download size.
Uber have used focused debt paydown efforts (e.g. "fixit weeks") to try and keep flag debt in check. I've heard of other orgs doing this too. It's been described to me as "the best tool we've found, but not a great tool" (paraphrasing a little here).
The stats in the paper about cleaned-up flags are interesting. It's not uncommon for a flag to be associated with over 100 lines of code, but flag-associated code tends to be restricted to a small number of files - 80% of flag-removal diffs involving 5 files or fewer.
This runs counter to the concerns I've heard from some folks getting started with feature flags that each new feature flag could lead to a bunch of conditionals strewn around a codebase.
Uber have added custom extensions to their automated testing frameworks to make it easier to declarative manage flag state in tests which are sensitive to that state - something I often see at orgs that are using flags heavily, but not something I've seen discussed much. 🤔
That wraps up what I found most interesting in this paper. Kudos to Uber for discussing this stuff so openly! 🙌

What other nuggets did *you* get from the paper?

Anyone have any other interesting reports from orgs talking about their usage of feature flagging in the wild?
You can follow @ph1.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: