精读预计 1 分钟

The time the x86 emulator team found code so bad they fixed it during emulation

摘要

在开发支持二进制翻译的 x86 模拟器期间，团队发现某程序在初始化 64KB 栈内存时，编译器未采用常规循环，而是将其展开为 65,536 条独立的写入指令，导致初始化代码占用高达 256KB。由于这种做法过于低效，模拟器团队在翻译器中增加了特殊逻辑，专门识别并将其替换为等效的紧凑循环。

荐读理由

x86-32二进制翻译的JIT编译器在运行时探测到编译器把64KB栈初始化拆成65,536条write字节指令后，自动替换成紧凑循环，防止编译器把栈探针代码膨胀成256KB二进制。

原文

During an exchange of war stories, a colleague of mine told one from back in the days when Windows included a processor emulator for x86-32 on systems that natively ran some other processor. (This has happened many times. And no, I don’t know which processor this particular story applied to.)

This particular emulator employed binary translation, generating native code to perform the equivalent operations of the original x86-32 code. This offered a significant performance improvement over emulation via interpreter. You can imagine that x86-32 is just a bytecode, and the emulator is a JIT compiler.

Anyway, my colleague found that there was one program that needed to allocate around 64KB of memory on the stack and initialize it. The standard way of doing this is to perform a stack probe to ensure that 64KB of memory is available, then subtracting 65536 from the stack pointer, and then initializing the memory in a small, tight loop.

But using a loop to initialize the memory was too mundane for whatever compiler was used to compile this code. Instead of generating a loop to initialize each byte of the buffer, the compiler “optimized” the code by unrolling the loop into 65,536 individual “write byte to memory” instructions, each 4 bytes long.

All in all, it took this program 256 kilobytes of code to initialize 64 kilobytes of data.

This offended the team so much that they added special code to the translator to detect this horrible function and replace it with the equivalent tight loop.

Hacker News · 146 赞 · 26 评讨论 → 阅读原文 →

这条对你有帮助吗？