Refactoring server's guts. It reimplemented the game in C++, causing desyncs on any tiny divergence from client's ASM. Now it embeds a 6502 emulator to run game's routines. Two days of work, it already passes my trickiest tests.
And... programming multi-xeon servers with 6502
And... programming multi-xeon servers with 6502

I am running into performance issues ... What a surprise!
Using the emulator as been a long way to crush bugs. I'll try to optimize it, really don't want to come back to the "reimplement the game" solution.
Using the emulator as been a long way to crush bugs. I'll try to optimize it, really don't want to come back to the "reimplement the game" solution.
Ok let's make a thread for this performance quest!
The starting point is the mos6502.cpp emulator. It is simple enough for a quick-start, and easily hackable. It is however not performance-oriented.
The starting point is the mos6502.cpp emulator. It is simple enough for a quick-start, and easily hackable. It is however not performance-oriented.
Two points of pain with the emulated tick, comparing to re-implementing it in C++.
Game state size grown from ~150 bytes to 2 KB. We copy it a lot.
The emulation itself, instead of running native code.
Profiling shows that copying is not the problem. Focus on emulation speed.
Game state size grown from ~150 bytes to 2 KB. We copy it a lot.
The emulation itself, instead of running native code.
Profiling shows that copying is not the problem. Focus on emulation speed.
Native implementation performed a tick in around half a microsecond. Before optimization, emulation takes around 50 μs per tick.
Doesn't seem to be a lot, it is still 100x worse. In the server, it translates to an input being handled in >10 ms, while it used to be around 0.3 ms
Doesn't seem to be a lot, it is still 100x worse. In the server, it translates to an input being handled in >10 ms, while it used to be around 0.3 ms
I will try to improve the situation. I do not have a precise goal. Let's just say that adding half a frame to the ping in normal circumstances is not acceptable.
One way I will NOT follow: JIT. It is tempting, but it would require too much time before being production ready.
One way I will NOT follow: JIT. It is tempting, but it would require too much time before being production ready.
First big win: direct call to overused functions.
Emulator lets you map memory, using function pointers for read/write.
This is called multiple times per instruction. Rule of thumb: if it is called on every byte, compiler must be able to inline it.
Current perf: 39 μs/tick
Emulator lets you map memory, using function pointers for read/write.
This is called multiple times per instruction. Rule of thumb: if it is called on every byte, compiler must be able to inline it.
Current perf: 39 μs/tick
Conditional win: do it once.
If both players press a button at the same time, the server simulates two times, trashing the first result. Doubling processing time.
Fix: do not compute state while there is incoming messages waiting.
Current perf: 39 μs/tick (but better)
If both players press a button at the same time, the server simulates two times, trashing the first result. Doubling processing time.
Fix: do not compute state while there is incoming messages waiting.
Current perf: 39 μs/tick (but better)
Small win: compute on change.
Small optimization: when the emulated code swaps bank, the pointer to the new bank is immediately computed. instead of recomputed on each read.
Current perf: 37 μs/tick
Small optimization: when the emulated code swaps bank, the pointer to the new bank is immediately computed. instead of recomputed on each read.
Current perf: 37 μs/tick
Easy win: another indirect call.
For each instruction, the emulator resolve the addressing, then executes the opcode. This is two indirect calls.
Generating all variants of opcode+addressing, and we save one indirect call per instruction!
Current perf: 33 μs/tick
For each instruction, the emulator resolve the addressing, then executes the opcode. This is two indirect calls.
Generating all variants of opcode+addressing, and we save one indirect call per instruction!
Current perf: 33 μs/tick
Small win: avoid branching
The read function has to handle memory mapping. It was done with IFs to know where to read (ram/rom/registers), causing miss prediction.
With a pointer table, reading is now the matter of three lines of code without branch.
Current perf: 30 μs/tick
The read function has to handle memory mapping. It was done with IFs to know where to read (ram/rom/registers), causing miss prediction.
With a pointer table, reading is now the matter of three lines of code without branch.
Current perf: 30 μs/tick
Meh win: compiler
clang > gcc
Current perf: 28 μs/tick
clang > gcc
Current perf: 28 μs/tick
Guru win: (almost) JIT
Recompiling the 6502 was too tempting! It is cold compilation: JIT-like structures are generated in the source. It allows for a pass of optimization by clang.
Making it work is trivial (3h work), making it fast is another story.
Current perf: 19 μs/tick
Recompiling the 6502 was too tempting! It is cold compilation: JIT-like structures are generated in the source. It allows for a pass of optimization by clang.
Making it work is trivial (3h work), making it fast is another story.
Current perf: 19 μs/tick
Final win: compile moar!
Compiler skipped some parts of the ROM. I guess it is the final emulation optimization. That's 163% speedier than the original implementation.
Current perf: 14 μs/tick
If it was too obscure: 6502 emulation in server ...
How it started // How it's going
Compiler skipped some parts of the ROM. I guess it is the final emulation optimization. That's 163% speedier than the original implementation.
Current perf: 14 μs/tick
If it was too obscure: 6502 emulation in server ...
How it started // How it's going
Erratum: 163% would be before the final optimization. Current code is 257% better than the naive first implementation.
That's important ... I guess ... For me, at least
That's important ... I guess ... For me, at least

Finish line!
Performances are acceptable. Next step will be to improve server's logic, and is less epic.
Starting this thread I did not expect going so far. I leveled up on emulation, JIT, and optimizations.
The emulator is available here: https://github.com/sgadrat/super-tilt-bro-server/tree/master/mos6502
See ya!
Performances are acceptable. Next step will be to improve server's logic, and is less epic.
Starting this thread I did not expect going so far. I leveled up on emulation, JIT, and optimizations.
The emulator is available here: https://github.com/sgadrat/super-tilt-bro-server/tree/master/mos6502
See ya!