%title: Haswell metaprogramming %author: @solardiz @Openwall BSidesLjubljana March 2017 -> Problem <- Some compute-heavy problems or approaches at solving them require too much code in their "inner loops" and/or too many processor registers - for speed Example: so-called "bitslicing" of cryptographic primitives, where the code resembles a hardware logic circuit, with each data bit requiring a register and each logic gate requiring an instruction Why do this bitslicing weirdness? Also for speed. Our N-bit CPU becomes N 1-bit CPUs, which provides the sometimes-needed flexibility and lets us optimize away "operations" such as bit shifts, rotates, and permutations. Another potential example: JIT-compiled code for matching of many requests against many rules in a packet filter, fast HTTP server, anti-DDoS proxy On typical CPUs such as Intel Haswell and beyond, L1 instruction cache is 32 KiB, the uop cache is smaller yet, and there are 16 general-purpose + 16 SIMD registers per thread. Suppose that's not enough. --- -> Solution <- Let the code be in L2 cache, which is 256 KiB Haswell can fetch and decode 16 bytes of instruction stream per cycle even for code coming from L2 cache (via L1), with no penalty on code throughput. (The 16 bytes limit also applies to fetch & decode from L1 on uop cache miss.) Haswell can execute up to 3 AVX2 instructions per cycle, and no more of those anyway. If we keep our instructions to <= 5 bytes each, we utilize the AVX2 execution units fully even with code in L2 cache. Haswell is also fast at accessing L1 data cache (same throughput as registers) With properly initialized 16 GP registers, we can fit 8-bit immediate offsets against those into 5-byte instructions and thereby address a 16\*256 bytes "register file". That's 128 extra AVX2-alike 32-byte "registers". We got a meta CPU with 256 KiB I-cache and 144 x 256-bit registers. And it's fast. Magic. --- -> The catch <- The workload and its portions must be throughput rather than latency bound Only one of the "registers" in an instruction can be virtual (in L1 cache), and it must be either source-only or destination of a MOV. Other registers in the same instruction have to be among the 16 regular AVX2 registers. There are only 2 read and 1 write ports to L1 data cache, so on average the 3 instructions processed per cycle must not exceed these ports' capacity. (Yet individual groups of 3 instructions may exceed it with no penalty, as long as the workload is only throughput bound as it should be.) In testing, this works great for code sizes of up to roughly one half of L2 cache size or slightly more (the threshold is typically at 130 KiB to 140 KiB, although it varies a bit between program invocations). This is likely because of L2 cache being physically-indexed, with higher code sizes resulting in high probability that we'd have duplicate indices in excess of the cache's 8-way associativity. (Page coloring or a custom allocator in the kernel or a kernel module could help run the full 256 KiB of code fast.)