Something I hadn't really considered with memory-mapped or bulk-read IO is that once you're reading through the buffer linearly interleaved with other processing, it's going to kick everything else out of the caches because of LRU replacement.
If you do traditional buffered IO where you keep updating the same buffer, that buffer's cache lines are just going to stay in cache and not pressure the rest of your working set.
E.g. consider a symbol table in a lexer/parser. Given the streaming read pattern, in theory the symbol table could use up most of your 256k L1 cache without any contention, so long as you were doing buffered IO with a buffer that's large enough to amortize each buffered read.
But if you do the same thing with memory-mapped IO or bulk-read IO (where you read the whole file into a buffer up front and then process it) you're going to just streamroll through the symbol table's cache lines because of how much data you're streaming through.
In theory this is the perfect use case for non-temporal read instructions but I've only heard horror stories of those when used in practice, so I don't know.
You can follow @pervognsen.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: