Do you enjoy reading tales of engineers’ quests to track down bugs with obscure causes? Here's my recent adventure:
<
>
Background: For the past few months, @DestinyTheGame has been using some Steam peer-to-peer networking tech I’ve been working on.
<

Background: For the past few months, @DestinyTheGame has been using some Steam peer-to-peer networking tech I’ve been working on.
The tech is software-based routing: route through general-compute linux boxes using custom protocol, with clients exercising significant routing autonomy. Since peers do not share their IP addresses, script kiddies cannot DDoS other players.
https://partner.steamgames.com/doc/api/ISteamNetworkingSockets
https://partner.steamgames.com/doc/api/ISteamNetworkingSockets
Since it launched, the disconnection rate ( #beaver errors) was higher than expected. Some players would get it often; others never saw it. Restarting the game would often make it go away. We could never reproduce it. (It works on my machine!)
Each connection involves 4 hosts: 2 clients (that we cannot access) and 2 relays. We dug through countless examples, correlating Bungie’s records, session summaries in our backend, and logs from the 2 relays. Forensic debugging of problems you cannot reproduce is frustrating.
Furthermore, it’s difficult to tell if a specific example is even a bug with packets getting lost in our system; it might just be a “normal” Internet problem.
Over a month or two, I fixed some minor bugs, and Bungie fixed some issues with the API integration. Each time, we thought this might be the underlying cause, but nothing moved the needle.
Towards the end, I was spending entire days playing Destiny with tons of extra logging.
Towards the end, I was spending entire days playing Destiny with tons of extra logging.
(The good news is that I got some great bonding time with my boys.
I highly recommend having enough children to fill out a fire team.
In a few years, my youngest will be old enough to play with us, at which point I might be brave enough to try gambit.)
I highly recommend having enough children to fill out a fire team.
In a few years, my youngest will be old enough to play with us, at which point I might be brave enough to try gambit.)
The breakthrough came while looking at an example involving two relays in Virginia. One relay had our experimental XDP path enabled.
XDP is Linux technology that enables you to bypass the kernel and receive Ethernet frames directly in userspace.
https://en.wikipedia.org/wiki/Express_Data_Path
XDP is Linux technology that enables you to bypass the kernel and receive Ethernet frames directly in userspace.
https://en.wikipedia.org/wiki/Express_Data_Path
The kernel does a heroic amount of work for you to deliver a single UDP packet, most of which is bypassed by XDP. So XDP is *insanely* fast. In our case, the XDP code can process 5x-10x the packets for the same CPU cost as the plain BSD socket code.
Unfortunately, we cannot deploy XDP everywhere, due to a bug in the Intel NIC driver.
We’re using a “Dell” Intel NIC, so @intel told us to take it up with them, and Dell can’t help us.
This relay had an AMD processor and Mellanox NIC, so we could enable XDP.
We’re using a “Dell” Intel NIC, so @intel told us to take it up with them, and Dell can’t help us.
This relay had an AMD processor and Mellanox NIC, so we could enable XDP.
So this relay was an “oddball” in the fleet, and for it to be implicated was suspicious.
I ran a query our connection analytics to see if it was an outlier.
I ran a query our connection analytics to see if it was an outlier.
YES! One host in Virginia had a significantly higher rate of disconnections than the otheWAITASECOND.
The outlier was not the relay using XDP, it was the *other* one.
WAT.
The outlier was not the relay using XDP, it was the *other* one.
WAT.
OK, so here's the bug:
With XDP, you are serializing raw Ethernet frames, including the next hop MAC address. If the final destination is local, the next hop is that host’s MAC, otherwise it’s a switch.
(This is usually the kernel’s “heroic work” I mentioned earlier.)
With XDP, you are serializing raw Ethernet frames, including the next hop MAC address. If the final destination is local, the next hop is that host’s MAC, otherwise it’s a switch.
(This is usually the kernel’s “heroic work” I mentioned earlier.)
Our XDP code is *not* heroic. It assumes that all traffic goes the same switch.
To forward a packet, we rewrite the header (importantly, the destination IP address) and swap the source/destination MAC addresses, sending it back to the switch, who knows what to do next.
To forward a packet, we rewrite the header (importantly, the destination IP address) and swap the source/destination MAC addresses, sending it back to the switch, who knows what to do next.
This topology assumption was violated in Virgina. One relay was on the same subnet as the XDP relay. So the source MAC was of the relay, not a switch.
XDP code gets a client-bound packet, fills in the client’s IP in the IP header, and swaps the source/dest MAC addresses…
XDP code gets a client-bound packet, fills in the client’s IP in the IP header, and swaps the source/dest MAC addresses…
…..which sends it back to the relay whence it came (where it is then dropped).
This was the cause of broken disconnections in Virgina. If two peers selected these two relays, they could not communicate.
This was the cause of broken disconnections in Virgina. If two peers selected these two relays, they could not communicate.
Once I knew to look for this type of outlier, I found three other examples exhibiting similar behaviour (but for less interesting technical reasons).
Bungie confirmed that the disconnection rate is now back to expected levels. https://www.bungie.net/en/Explore/Detail/News/49387
Bungie confirmed that the disconnection rate is now back to expected levels. https://www.bungie.net/en/Explore/Detail/News/49387
Why wasn’t the XDP relay an outlier? Because it wasn’t reporting those disconnected sessions. We don’t log stats for “uninteresting” sessions (sessions used only briefly or not at all).
A disconnection is usually considered “interesting”, no matter how brief.
But a bug triggered by the asymmetric nature of the loss broke this. A function call had a bool and int parameters swapped, and gcc and MSVC both performed the implicit conversions without complaint!
But a bug triggered by the asymmetric nature of the loss broke this. A function call had a bool and int parameters swapped, and gcc and MSVC both performed the implicit conversions without complaint!
Why did it take so long to find/fix?
Because we were myopic, looking for a software bug. Each time we found something, we thought “this is it!” We *were* finding real problems, they just were very rare in practice.
Also: networking is complicated.
Because we were myopic, looking for a software bug. Each time we found something, we thought “this is it!” We *were* finding real problems, they just were very rare in practice.
Also: networking is complicated.
Why didn’t monitoring catch this?
We do have significant monitoring, especially for problems *between* data centers. But it did not detect problems between hosts in the same data center.
Also: networking is complicated.
</
>
We do have significant monitoring, especially for problems *between* data centers. But it did not detect problems between hosts in the same data center.
Also: networking is complicated.
</
