Refactoring server's guts. It reimplemented the game in C++, causing desyncs on any tiny divergence from client's ASM. Now it embeds a 6502 emulator to run game's routines. Two days of work, it already passes my trickiest tests.

And... programming multi-xeon servers with 6502 😎
I am running into performance issues ... What a surprise!

Using the emulator as been a long way to crush bugs. I'll try to optimize it, really don't want to come back to the "reimplement the game" solution.
Ok let's make a thread for this performance quest!

The starting point is the mos6502.cpp emulator. It is simple enough for a quick-start, and easily hackable. It is however not performance-oriented.
Two points of pain with the emulated tick, comparing to re-implementing it in C++.

Game state size grown from ~150 bytes to 2 KB. We copy it a lot.

The emulation itself, instead of running native code.

Profiling shows that copying is not the problem. Focus on emulation speed.
Native implementation performed a tick in around half a microsecond. Before optimization, emulation takes around 50 μs per tick.

Doesn't seem to be a lot, it is still 100x worse. In the server, it translates to an input being handled in >10 ms, while it used to be around 0.3 ms
I will try to improve the situation. I do not have a precise goal. Let's just say that adding half a frame to the ping in normal circumstances is not acceptable.

One way I will NOT follow: JIT. It is tempting, but it would require too much time before being production ready.
First big win: direct call to overused functions.

Emulator lets you map memory, using function pointers for read/write.

This is called multiple times per instruction. Rule of thumb: if it is called on every byte, compiler must be able to inline it.

Current perf: 39 μs/tick
Conditional win: do it once.

If both players press a button at the same time, the server simulates two times, trashing the first result. Doubling processing time.

Fix: do not compute state while there is incoming messages waiting.

Current perf: 39 μs/tick (but better)
Small win: compute on change.

Small optimization: when the emulated code swaps bank, the pointer to the new bank is immediately computed. instead of recomputed on each read.

Current perf: 37 μs/tick
Easy win: another indirect call.

For each instruction, the emulator resolve the addressing, then executes the opcode. This is two indirect calls.

Generating all variants of opcode+addressing, and we save one indirect call per instruction!

Current perf: 33 μs/tick
Small win: avoid branching

The read function has to handle memory mapping. It was done with IFs to know where to read (ram/rom/registers), causing miss prediction.

With a pointer table, reading is now the matter of three lines of code without branch.

Current perf: 30 μs/tick
Meh win: compiler

clang > gcc

Current perf: 28 μs/tick
Guru win: (almost) JIT

Recompiling the 6502 was too tempting! It is cold compilation: JIT-like structures are generated in the source. It allows for a pass of optimization by clang.

Making it work is trivial (3h work), making it fast is another story.

Current perf: 19 μs/tick
Final win: compile moar!

Compiler skipped some parts of the ROM. I guess it is the final emulation optimization. That's 163% speedier than the original implementation.

Current perf: 14 μs/tick

If it was too obscure: 6502 emulation in server ...
How it started // How it's going
Erratum: 163% would be before the final optimization. Current code is 257% better than the naive first implementation.

That's important ... I guess ... For me, at least 🤓
Finish line!

Performances are acceptable. Next step will be to improve server's logic, and is less epic.

Starting this thread I did not expect going so far. I leveled up on emulation, JIT, and optimizations.

The emulator is available here: https://github.com/sgadrat/super-tilt-bro-server/tree/master/mos6502

See ya!
You can follow @RogerBidon.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: