Thread by @KooKiz, I just ran into an interesting performance issue. We're implementing a feature [...]

I just ran into an interesting performance issue. We're implementing a feature that collects runtime metrics from apps in the background (using EventListener/EventCounters, collection is done in-process). I load-tested it and found a 4.5% overhead, far above what I expected [...]

The setup is 2 Amazon EC2 instances on Linux. One c5.xlarge (4 cores) generating the traffic using wrk, one c5.large (2 cores) hosting an aspnetcore app that is instrumented using our library. Thoughput was ~13.8K requests/second with the new metrics, ~14.5K without

What struck me immediately was the CPU usage, that averaged to 85% when metrics are enabled. At the beginning of each run there could be a 1-2 minutes timeframe where CPU was at 100%, but it would eventually fall down. With metrics disabled, CPU was at 100% all the time

I spent some time looking for a wall-clock profiler on Linux. In the process I discovered perf timechart, which unfortunately produces a 500MB svg file for 5 seconds of profiling, which is completely unusable. That's a shame because the issue is clearly visible in the preview

Eventually I remembered that good old dotTrace now works on Linux. It allowed me to discover that our statsd client uses a threadpool worker to consume a background queue. The thread spends most of its time waiting synchronously for new items. Not ideal but shouldn't be harmful.

Then I captured another trace with metrics disabled, and something struck me: both traces had the same number of managed threads, even though the one with metrics enabled had the one worker spending 99.9% of the time waiting

(metrics disabled on the left, enabled on the right)

Then I captured another trace with metrics disabled, and something struck me: both traces had the same number of managed threads, even though the one with metrics enabled had the one worker spending 99.9% of the time waiting (metrics disabled on the left, enabled on the right)

For some reason it looks like the threadpool won't grow above 3 threads for this app. When one of those threads was busy waiting, the 2 remaining ones weren't enough to saturate the CPU. To confirm, I changed the consumer to use a dedicated thread, and CPU usage went back to 100%

Going back to the previous runs where CPU was at 100% and fell down to 85%, it looks like the threadpool started with a large number of workers, then progressively scaled down to 3. Past this point, it oscillated between 2 and 3, but never got to 4.

That's a bit of an extreme case, but I'm surprised the threadpool failed to conclude that the app performed better with 4 threads than 3. It's possible to enable events that give the throughput measured by the threadpool, I'll try playing with them: https://docs.microsoft.com/en-us/dotnet/framework/performance/thread-pool-etw-events#threadpoolworkerthreadadjustment

Thread Pool ETW Events

Review thread pool ETW events, which collect information about threads in .NET. Thread pool events are worker thread pool events or I/O thread pool events.

https://docs.microsoft.com/en-us/dotnet/framework/performance/thread-pool-etw-events#threadpoolworkerthreadadjustment

Latest Threads Unrolled: