One of my fundamental beliefs about debugging is that you have to be prepared. If you are new to a company/environment/technology, you need to talk to folks and build up the set of tools and techniques that apply. #ToggleTalk
Don't wait until you are paged out for an incident and discover that gdb/strace/tcpdump aren't sufficient to figure out what's going on. #ToggleTalk
Having a space to share "how I figured this out in my dev|test environment" stories can help build up the set of useful (and not so useful) tools. This means having enough time for those conversations and psychological safety to talk about what is not known #ToggleTalk
One of the most satisfying debugging stories I have was in a distributed system. We kept seeing exponential stampedes in production that development kept telling us wasn't possible. The problem was so impactful that cleanup was taking all possible debug time. #ToggleTalk
javascript and D3 visualizations were super useful in hacking up dashboards to tell the story of what was happening. We had some form of what is happening at that point, but not the why. #ToggleTalk
Next, I worked on leveling up everyone on the team to be able to recognize the symptoms so that we could minimize the impact of the stampede. This meant coordinating response and NOT immediately doing all the work myself. #ToggleTalk
I worked hard on reframing the conversations we were having as well. Instead of cause ("the customer is doing this") focusing on ("the system is behaving in this way when this behavior occurs") and what parts of the system can we change beyond how we respond. #ToggleTalk
The combination of these things helped reduce the friction we were having with customers as well as educate the whole team to respond to the customer impact while giving me the mental space to start tackling the debugging of the system problem. #ToggleTalk
A feature that was pushed on a critical timeline earlier in the year leveraged existing functionality causing duplicate messages to be sent out. For every message to itself that it should ignore, it would duplicate. Reading the code + tcpdump to figure it out #ToggleTalk
So the "what to use" isn't super interesting, but we had looked at TONS of tcpdumps prior to this. Having the space/time to do so while not also dealing with the intricacies of high impact incidents was critical to figuring out what happened #ToggleTalk
You can follow @sigje.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: