Solana halt today, what happened?

A bug relating to durable nonces was triggered on a majority of validators, leading to a block hash mismatch & validators stalling.

This was actually a known bug that was being fixed and hadn't been triggered in this form previously.
The issue is completely unrelated to clock drift and the clock drift has nothing to do with the network halting.

As soon as validators observed that the network ceased producing blocks they started confirming with each other on Discord that everyone was seeing the same thing...
And once this was confirmed, while no root cause was yet known, or whether a patch would be needed before restarting, the process to identify the restart slot began (because this doesn't depend on knowing those other things).

Slot 135986379 was chosen....
...as this was quickly ascertained to be the highest observed optimistic slot.

From here validators began generating snapshots from their ledgers to confirm the expected bank hash and shred version.
This too went quite quickly (snapshot creation takes 30-40 mins) and almost universally the same hash and shred version were created, indicating easy consensus.

At this point the need for a patch had become clear.
We had already observed an incidence of this error on our validator last Saturday which we reported, at that time it didn't affect the broader cluster.

From our logs we could quickly tell this was the same bug. Solana Labs engineers confirmed this from their nodes as well.
A durable nonce is a way in which a transaction can be signed offline ahead of time, without requiring a recent block hash (which expires after two minutes). Usage has recently increased, particularly by exchanges, possibly due to their cold storage setups.
The bug would cause a durable nonce that is created and submitted in the same slot to be able to be executed a second time in a later slot.

Validators then notice this and an error is triggered indicating this transaction was "AlreadyProcessed"
This is a good thing it means it wasn't run twice, but the question remains as to how it got this far into the replay stage. And why validators can't recover from this error.
While we don't have all the answers to that the immediate solution was clear: disable durable nonces to ensure the network can restart, with a feature gate to re-enable them in future as needed and once safe to do so.
Core engineers rapidly deployed a patch for this which was taken up by validators: Either 1.9.28 for those on the 1.9 branch or 1.10.23 for those on the 1.10 branch.

At this point snapshots were created, restart instructions had been agreed & validators were ready to restart.
During a restart validators start up, load the ledger then "hold position" at the specified restart slot until 80% of stake is ready.

As soon as they see 80% of stake active in gossip the leader at that slot can produce the first block and the network is immediately operating.
In total this downtime lasted approximately 4h30m, being resolved at approximately 2100 UTC.

The clock drift was not resolved by this restart, but the upgraded versions contain a feature that can be enabled at a future epoch boundary to jump the clock to the correct time.
You can follow @laine_sa_.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: