精读预计 24 分钟

A line-by-line translation of the OCaml runtime from C to Rust

摘要

作者将 OCaml runtime 的 71 个 .c 文件逐行翻译为等效 .rs 文件，每步都提交已知状态以便回退。项目在 rust-runtime 分支（Rust stable），rust-runtime-nightly 分支用 nightly 特性（显式尾调用等）进一步优化。测试包括 OCaml 编译器测试套件、sandmark 基准、TSan、构建 opam 包等。作者预期 native 慢 10-20%、bytecode 慢 20-30%，实际结果 native 约持平（rust stable 0.87-1.13x），bytecode 稳定版 1.44x、nightly 可达 0.91x 或更快。还分析了 unsafe 必要性（类型擦除、GC 借用检查、C ABI 兼容），并探讨了作为教育工具的 LLM 翻译过程。

荐读理由

根据原文，照 https://github.com/mbacarella/rustcaml 项目代码就能把 OCaml runtime 从 C 完整一行一行翻译成 Rust 稳定版本，克隆后切换 rust-runtime 分支运行正常构建指令就能用上，能在 AI 工程项目里迁移来代替部分 runtime 逻辑

原文

A line-by-line translation of the OCaml runtime from C to Rust

The Rust community owes the OCaml community a debt of gratitude, the first Rust compiler was written in OCaml. In this post we rewrite the OCaml runtime in Rust. Consider it our down payment on repaying the debt.

No Marks Shinwell were harmed[1]

More seriously

The OCaml runtime is written in C. I present a port of the C to Rust. It works. It passes the OCaml compiler’s test suite. The test suite is upstream’s, unmodified. The OCaml compiler with the Rust runtime can build itself, can build dune, install an opam switch, and build arbitrary OCaml programs in both bytecode and native. There’s no C left in the runtime. The performance is… surprising! The most fun part was what it taught me about OCaml.

The fork is here https://github.com/mbacarella/rustcaml, if you want to check it out. You just clone the repo, be on the rust-runtime branch and run the normal build instructions. Make sure you have Rust cargo installed. There’s also a rust-runtime-nightly branch that runs the interpreter faster. More on these below.

Consider the rest of this post more as an experience report and proto case-study of a human steered AI driven rewrite of a mature, sophisticated codebase: the OCaml runtime.

The way this was done was not by saying “yo Claude, spin up an agent swarm or whatever, you’re going to greenfield design and implement an OCaml runtime from scratch in Rust. Make no mistakes”. Rather, the instructions were more like “lets file-by-file, line-by-line translate the C code to equivalent Rust code and check with me every step of the way”.

Despite fairly low-tech orchestration, normal person token budget and Claude’s protests, it worked!

Wait, isn’t OCaml written in OCaml?

Quick detour for readers wondering why OCaml isn’t written in OCaml. Most of it is! The compiler, the standard library and the tooling are. But the runtime is written in C. The source tree ships with boot/ocamlc, a bytecode build of the compiler from a previous version. To build OCaml from source, you first build a C program called ocamlrun that knows how to execute OCaml bytecode. Then you use ocamlrun + boot/ocamlc to compile the current sources into native ocamlc. Why not write the runtime in OCaml too? You could, but one reason not to is then you’d have to ship prebuilt binaries for every architecture and OS you plan to support, forever. There are other good reasons too, which will be explored below.

A word on methodology

I used Claude Code for this with the Opus 4.7[2] model on an x86-64 Linux box. Claude did most of the tactical work: coding and presenting a menu of choices on where to go next, and running the tests. I did the steering and some pep talking (more discussed below).

The Bun project recently reported porting 960k(!) lines of Zig to Rust, which generated a fair bit of, let’s say… excitement. I’ll confess my eyes glaze over the moment I see “JavaScript” so I come to this kind of late. But the comparison is instructive because of the differences: Bun runs on JavaScriptCore, so the garbage collector and the interpreter are C++ and were never in scope for the Rust port. Gnarly as that rewrite was, it’s kind of machinery wrapped around a pre-existing engine. It side-steps porting the really interesting part, IMO.

So, that makes this project different. The OCaml runtime is an expertly engine written C: sophisticated multicore machinery, a real garbage collector, FFI contracts that have to be honored to the byte. None of this stuff is part of the normal programmer experience. How would an LLM do at porting this? I wanted to know!

The plan was simple but I think worth spelling out. Make sure the compiler is never not able to run. Put a per-file toggle in the build: flip it and the linker selects the Rust version of a file instead of the C version. This let us port one file at a time, run the full test suite after each flip, and commit a known-good state before moving on.

Each .c file was effectively line-by-line translated to a .rs file as closely as could be done, as that would make it easiest to track and incorporate upstream changes in OCaml. The port can grow with older brother OCaml.

The purpose of the git commit after each port was so that if we get really stuck in the near or far future, like we discover an application that trips an extremely obscure bug, we’ll be able to bisect to any point in time to figure out where in the porting effort the divergence was introduced. The OCaml runtime was never not in a state where it couldn’t run, whether it was 2%, 50% or 99% ported to Rust.

We ended up with two branches. A rust-runtime branch which was kept to Rust stable, and then rust-runtime-nightly, which added Rust nightly features to experiment with some performance.

Benchmarks

Okay, I made you wait long enough. You’re all curious. My prediction going into this was that the Rust-based native executables would be about 10-20% slower and the byte code interpreter would run 20-30% slower.

This was my expectation given how I had constrained the port, no use of idiomatic Rust, match the C line-by-line as closely as possible, stick to Rust stable features, and don’t use the computed gotos.

The headline results are “eh, just about parity, kinda”

Runtime	Bytecode (cycles vs C)	Native (cycles vs C)
C (trunk)	1.00x: (baseline)	1.00x (baseline)
Rust (stable)	1.44x, slower than C	~1.05x (range: 0.87-1.13)
Rust (nightly) (ETCs)	0.91x, faster than C	same as Rust stable

According to a selection of sandmark tests, the native executables are at about parity. The byte code interpreter, on the other hand, runs about 2x slower in Rust stable because of missing computed gotos. If we switch to Rust nightly and add explicit tail calls (ETCs), either equals or even slightly beats the C runtime.

I was expecting the non-idiomatic translations to be slower across the board since you’re fighting the language instead of letting it shine. But it turns out C-like-Rust, albeit with some warts, can perform about as well as C.

Full table of benchmarks are at the end.

2015 `unsafe`s

ocaml/runtime % rg unsafe *.rs | wc -l
2015

Let’s also get this out of the way: this is not safer for being in Rust. This cut is probably slightly unsafer than the C, if anything. But the ~2015 unsafes aren’t a translation failure. Rust simply makes the unsafe that was already there more legible.

Three reasons the number can’t be much lower, most irreducible first:

Type erasure. After compilation, every value is one machine word tagged int or heap pointer, and the runtime spends its whole life reading and writing untyped words. There are no static types at that level for Rust to check; every field access is ~a raw pointer deref. The borrow checker can’t work with that.
The runtime is the GC. It mutates and moves live pointers during collection. This is precisely the set of things the borrow checker exists to forbid. Any Rust port of language runtimes will struggle with this, human or machine written!
It must speak the C ABI because that’s the OCaml runtime ABI. Further, Rust has no stable FFI of its own, so extern “C” is the only way to interoperate between languages. This is just unavoidably unsafe.

A fourth reason is on me: I boxed (ha!) Claude into translating the C line-by-line instead of idiomatically, to track upstream, so some of these preserve the C correspondence and could be encapsulated later if the project ever became unconstrained.

A lot of unsafe can’t be removed without breaking OCaml as it is known. In this sense, the introduction of unsafe is simply revealing how much of what goes on inside of the runtime is carefully managed unsafety.

That all said, this could be a good base to increase safety overall if the project diverges from build and binary compatibility with OCaml.

More musings

When I first asked Claude to take inventory and propose porting this project it thought it was fairly crazy and unlikely to succeed. At some point it predicted subtle errors would creep in and become quasi-irreducible and it would take 2-3 years to finish, if we didn’t get bored of it by then.

I had to kind of give it a pep talk and go over ways we could tread lightly and keep our confidence high. It reminded me of working with human developers, in a way.

Not that it wasn’t correctly calibrated! The OCaml runtime does push C to its limits! It uses a lot of very clever macros to make the developer experience manageable. The multicore system makes strong guarantees. The foreign function interface (FFI) depends on C ABI compatibility, and that must be maintained forever as Rust has no standard ABI of its own. Once in awhile it would port a file, fail to make test progress, get discouraged and revert it and suggest we try something else for now.

Here in meat space, the whole project took about 7 days. Most of it was spent waiting on me to approve commands while I went about my normal life (you can direct Claude Code from the mobile app but it’s still pretty flakey and hangs a lot). Unlike the experiment with the Claude C Compiler, I was not using “Claude Teams”, or any other kind of multi-agent orchestration. I only had one agent going. A lot of time was also spent waiting for test suites to re-run every time we ported one of the 71 runtime/*.c files to .rs. Iterating and isolating and fixing failures that did crop up took many hours at a time.

Regarding workflow, I did very little gatekeeping of the code. In my experience, line-by-line translations are something LLMs excel at. Also I wouldn’t have been able to review ~40,000 lines of code. Rather, I spent more time at the file unit level and decided which files to port in what order and ensuring these were being exercised in tests. Mostly I kept imagining different ways of how horrible it would be if we passed every test we could think of and then ran some obscure application and got a weird heap corruption bug and had no idea where to start. Keeping that in mind reined the LLM in a bit.

Claude’s carefulness discipline was good but not great

The compiler testsuite and sandmark were our guiding light and Claude never once attempted to hack them to make progress. Claude did get confused once in awhile though. Early on, a test triggered a segfault and Claude kept going, continuing to port more files over. When I pushed back and asked why it didn’t investigate the segfault, it said it assumed that the segfault was always there, even in the trunk version, and never thought to check.

This happened a few times, actually. In long running sessions, especially if context has been compacted, it can lose track of what the gold condition of the tests is and convince itself that the current test failures were not introduced by its changes. I would have to remind it to run the testsuite on trunk and see that they all come back green to really get the point home.

I know I said LLMs excel at line-by-line translations, and they do. They’re better than humans are at it, though they’re not flawless. Aside from bulldozing past the segfault that one time, it also made mistakes like not reproducing big enums perfectly, like if they started at a different index from 0. Once in awhile it gets magic numbers backwards in initializers (e.g. { a=2, b=1 } instead of { a=1, b=2 }. If there’s a lot of stuff to manually transcribe it was smart enough to write some sed/awk scripts to translate them, but not always. Generally these mistakes showed up quickly in tests, but took a lot of time for it to iterate on to spot and fix.

Claude frequently doubted itself but with simple prodding it would do the work and succeed spectacularly. Other times it grouped things up not realizing they could still be independent. E.g. it decided we needed to port the entire GC at once, which spanned several large files, even though we’ve been following a one-by-one porting strategy the entire time, and was getting ready to try all four at once and really knuckle down trying to debug 5000 lines of very intensive Rust. But when reminded of the fact that we’d been following a by-file strategy the whole time and asking why it couldn’t work here, it admitted it could, and then did it.

Overall, Claude had fairly good instincts. It felt like pair programming with someone who’d internalized the OCaml runtime, C and Rust all at once. But also like pair programming with someone that had ADHD.

Tests were done

Tests run

OCaml compiler’s test suite
The fixpoint test: having the compiler build itself and then using that to build itself again and confirming the two generations of compiler code are identical
Sandmark benches, both A/B test and performance
TSan tests
Build an opam switch with the compiler (basically, tell it the system compiler is at ~/code/ocaml)
Build dune and run its test suite
opam install some community packages
- iter, gen, qcheck-core
- fmt, logs, astring, rresult, jsonm, uucp, uuseg
- digestif, decompress, optint, ke, hxd, checkseum
- mpg123, curses

As a pedagogical tool

One thing I underestimated going in is the educational value of having an LLM translate code, especially this code, while you watch. Reading the C directly gets you a partial picture, you see what each function does and why it’s structured that way. Asking the model to translate it to another language seems mechanical enough, but the places where it stumbles or prompts a choice reveals things you might be missing.

Halfway through we were translating lf_skiplist.c, and I realized I’d never used a skip-list before in my life[3]. Reviewing it, I wondered why this wasn’t a tree? It would be a lot easier for Rust to represent it, that’s for certain. But then I realized the difference is the concurrent, lock-free part is actually doable with a skip-list in a way that would be extremely hard if it was a tree. This is a really elegant decision.

As an aside: if you’ve ever thumbed through the Algorithm Design Manual, perhaps fallen asleep at night with it on your chest, and lamented that, despite there being like 75+ algorithms catalogued, you only use the top 5, maybe consider a career in compilers.

Tying up the lock-free skip-lists part. There’s a bit of a meme about how hard it is to do linked lists in Rust because it maps very poorly to its lifetime model. The Rust port side-steps the borrow-checker completely here by preserving the raw pointer manipulations. It’s actually merely as hard as the C version if you useunsafe

A second, more interesting problem was the bytecode interpreter. It leverages computed gotos for a big performance win. Computed gotos are a non-standard extension to C compilers (supported in GCC and LLVM). I knew these existed but the extension is so niche I had also never used one in my career.

The reason computed gotos are fast for interpreter loops is not because they optimize jumping to handlers, but because they make the jump predictable to the branch predictor inside of the CPU. The Python interpreter had a rather big performance regression recently because they stepped on a toolchain footgun that altered the behavior of this heuristic. But, computed gotos are not representable in Rust stable. More on the workaround in the “rust-runtime-nightly” section below.

A last, more minor surprise to me, and a bit of an embarassment since I’ve been in OCaml land for so long, is that I had a mistaken impression of how runtime values are encoded. I knew OCaml’s runtime steals a bit from each value to flag whether it’s an immediate (an int63) or a heap pointer. But I had it backwards, I thought it stole the top bit and simply halved the address space it could address since on 32-bit platforms only 2GB is effectively usable for a heap anyway (the kernel reserves half), and on 64-bit platforms you won’t really suffer if you’re restricted to only 9 quintillion bytes. But it’s the opposite: it steals the bottom bit. If the bit is set to 0 it just knows it’s a pointer because all heap pointers are 4 or 8 byte aligned. Those have the bottom bit set to zero naturally. If the bottom bit is set to 1, it knows the top 63-bits are an integer, and it simply shifts it downwards. In the space of programming language GCs, this is a nuanced trade-off and each does it a little differently. In OCaml it means pointers require no decoding to use and immediates only take a shift to decode. In some cases math operations can be done to the encoded values directly, no shift required.

It’s worth noting that two of these, the computed gotos and the value encoding, are major reasons why the OCaml runtime can’t be optimally written in OCaml itself. The runtime has to live below the abstractions the language provides and that’s one reason to choose something lower level, like C. And why translating it into more idiomatic Rust is not trivial.

Inline assembly, for compatibility!(?)

Three features used by the C based runtime were only available in Rust nightly. Rust nightly can change on you at any time and shouldn’t be relied on, so you’re encouraged to stick to Rust stable if you want timeless software. Given the constraint, the LLM cheerfully offered to implement them in inline assembly to stay on Rust stable

Thread local storage: normally available in C compiler extensions. A form of this exists in the Rust standard library, but because we’re using no_std we had to either go with Rust nightly or roll or our.
Variadic functions: A basic feature in C. Rust can call into variadic functions but can’t define them without an extension from Rust nightly. The inline assembly version of this is not too horrible.
You cannot initialize a Rust static from the address of another Rust static, but this is needed to implement ephemerons in OCaml. C handles this with link-time address relocations. Rust const eval refuses. Inlined assembly solution: just mark some things .weak in an inline asm block and you’re done.

One feature in C has no Rust equivalent. setjmp/longjmp, or non-local gotos. The interpreter needs these for exception handling. Rust cannot safely call sigsetjmp and be called into by sigsetlongjmp. But you can make an inline assembly version! There’s a lot of concerns on doing this generally in Rust because of returns_twice, but Claude audited the two call sites that use it and decided they don’t rely on persisting local variables that the returns_twice presents clobbering.

Introducing inline assembly in this project isn’t too foreign, as the OCaml native compiler itself emits hand-tuned assembly itself.

There’s a `rust-runtime-nightly` branch, though

The Rust stable version of the C bytecode interpreter with the computed gotos was ported to a straight loop { match ... }. Performance suffered considerably! So, we decided to introduce Rust nightly in a separate branch to see if we could bring in explicit tail calls to resolve the loss of computed goto in the interpreter.

Benchmarking shows it is actually up to 5% faster than the C computed goto version. So, there’s that.

But why?

At the minimum it was a fun experiment, and that’s reason enough for me. But the thing that actually nags at me: the OCaml runtime is in C because of when it was written, not because anyone recently weighed C against the alternatives and picked it. Rust only appeared two decades after OCaml broke ground. That’s legacy, not an active decision. And the interesting thing about cheap mechanical rewrites is that they turn “it’s in C” from a settled fact back into a live question.

I am not claiming the answer is “therefore, rewrite it in Rust”

The mechanical translation is the easy, cheap part and this whole post is evidence of that. The expensive part is everything the rewrite doesn’t settle, and now that the mechanical cost is near zero, those considerations stop being footnotes and become the whole conversation.

What are the actual tradeoffs? What would it take to trust a mechanical rewrite as much as a hand-written one? If a human had done this exact port by hand over two years, would you trust it more? And if so, what exactly is that delta made of, and can it be closed?

Open questions! And more open than they were a year ago.

Where do we go from here?

I’m really excited this worked at all. I had an incredible amount of fun doing it and if this makes for at least an interesting coffee conversation for some people it is worth it alone.

I don’t seriously expect the OCaml team to switch to this port (it does not comply with their new AI policy, for one). But I would be a little surprised if this experience report caused no update for anyone in the world. Human steered AI rewrites are definitely something that should be part of a programmer’s tool kit, especially if there’s a mature test suite and you have an existing system to use as a source of truth. Even without infinite token budget like AI lab insiders might have, you can still get pretty far!

Future steps from here may include: looking for more opportunities to encapsulate the unsafe parts, verify more of the ecosystem (perhaps by back-porting to a more widely adopted OCaml release), and really see how good a job it did of not introducing new runtime bugs. More speculatively, we could try a toy OCaml fork that breaks compatibility to see how much a performance cost is incurred if we eliminate all unsafe.

Appendix A: Full benchmarks (aka “eh, just about parity, kinda”)

Take these benchmarks with a grain of salt.

The pure Rust stable runtime is at parity with native executable performance but has bigger worst case performance with the interpreter because of lack of computed goto.

With Rust nightly, adding the become keyword for explicit tail calls (ETCs) in lieu of computed goto, interpreter performance is at parity if not faster. One should note that the Rust docs say explicit_tail_calls are “currently incomplete and may not work properly”.

One caveat! Rust shows more win on benchmarks that hammer the GC slow paths, and we wondered if that was just build flags rather than “Rust, the language”. The Rust runtime is one crate built with -O3 and link-time optimization (LTO), while upstream C is -O2 and separate translation units (TUs), no LTO. Upstream calls -O3 “somewhat risky”. For the sake of trying to be more apples-to-apples, we rebuilt trunk’s C runtime with -O3 -flto and re-ran everything. Result: flags explain about a third of the lists win, none of the win in finalise, and on bytecode there’s actually a 9% regression; cross-TU inlining bloats the interpreter loop. Upstream’s default build turns out to be the C runtimes best configuration, which makes it a better baseline.

The results below are upstream’s default flags vs Rust with LTO and -O3.

Again, take these benchmarks with a grain of salt. IMO the takeaway from this is “eh, about parity, kinda”, not “Rust gives a significant boost! We must switch right now!”

Given how posted benchmarks usually go, you should expect the fifth comment down to explain why my benchmarks are completely wrong. Mentally fill it in if it’s missing.

OCaml native and byte-code, no flambda

=== [stock] WALL CLOCK (min ms, lower is better) ===
                        trunk                     rust-runtime              rust-runtime-nightly
native
  almabench             782.61 ms                 780.61 ms (0.997x)        788.07 ms (1.007x)
  bdd                   1553.51 ms                1531.07 ms (0.986x)       1530.86 ms (0.985x)
  crout_decomposition   460.74 ms                 463.42 ms (1.006x)        473.57 ms (1.028x)
  durand_kerner_aberth  569.35 ms                 560.39 ms (0.984x)        567.75 ms (0.997x)
  kb                    1272.94 ms                1269.73 ms (0.997x)       1245.44 ms (0.978x)
  soli                  2798.39 ms                2850.08 ms (1.018x)       2825.36 ms (1.010x)
  binarytrees5          1969.36 ms                1936.12 ms (0.983x)       2044.22 ms (1.038x)
  alloc                 5916.50 ms                6654.26 ms (1.125x)       6612.67 ms (1.118x)
  lists                 204.67 ms                 177.14 ms (0.865x)        178.34 ms (0.871x)
  finalise              351.60 ms                 326.26 ms (0.928x)        326.49 ms (0.929x)
  TOTAL                 15879.67 ms               16549.09 ms (1.042x)      16592.80 ms (1.045x)
bytecode
  almabench             5201.01 ms                4754.61 ms (0.914x)       4143.59 ms (0.797x)
  bdd                   26813.52 ms               36608.96 ms (1.365x)      25355.17 ms (0.946x)
  crout_decomposition   6862.29 ms                7705.78 ms (1.123x)       6121.15 ms (0.892x)
  durand_kerner_aberth  12174.51 ms               14244.28 ms (1.170x)      12766.48 ms (1.049x)
  kb                    8106.58 ms                11773.33 ms (1.452x)      8507.30 ms (1.049x)
  soli                  89943.46 ms               112959.91 ms (1.256x)     74206.08 ms (0.825x)
  binarytrees5          8493.53 ms                13385.23 ms (1.576x)      8839.91 ms (1.041x)
  alloc                 70911.11 ms               128335.46 ms (1.810x)     69083.19 ms (0.974x)
  lists                 311.60 ms                 342.14 ms (1.098x)        310.50 ms (0.996x)
  finalise              1448.42 ms                1917.89 ms (1.324x)       1327.77 ms (0.917x)
  TOTAL                 230266.01 ms              332027.59 ms (1.442x)     210661.13 ms (0.915x)

=== [stock] INSTRUCTIONS (min retired, lower is better) ===
                        trunk                     rust-runtime              rust-runtime-nightly
native
  almabench             11,982,525,444            12,045,767,816 (1.005x)   12,119,109,811 (1.011x)
  bdd                   23,195,880,693            23,672,243,713 (1.021x)   23,496,447,226 (1.013x)
  crout_decomposition   10,540,697,462            10,528,518,080 (0.999x)   10,529,006,913 (0.999x)
  durand_kerner_aberth  9,925,393,139             9,931,601,603 (1.001x)    9,933,069,944 (1.001x)
  kb                    16,180,752,516            16,273,026,843 (1.006x)   16,446,171,810 (1.016x)
  soli                  80,190,717,599            80,192,319,756 (1.000x)   80,192,251,301 (1.000x)
  binarytrees5          36,203,768,664            37,582,691,016 (1.038x)   39,510,055,070 (1.091x)
  alloc                 130,567,325,652           130,484,776,522 (0.999x)  130,485,472,014 (0.999x)
  lists                 4,080,393,283             3,694,861,446 (0.906x)    3,694,934,206 (0.906x)
  finalise              6,795,543,488             6,445,368,456 (0.948x)    6,339,747,723 (0.933x)
  TOTAL                 329,662,997,940           330,851,175,251 (1.004x)  332,746,266,018 (1.009x)
bytecode
  almabench             65,289,519,768            81,655,602,353 (1.251x)   75,567,878,905 (1.157x)
  bdd                   322,229,228,258           511,376,407,765 (1.587x)  351,643,764,781 (1.091x)
  crout_decomposition   89,704,774,045            126,778,953,728 (1.413x)  109,322,946,879 (1.219x)
  durand_kerner_aberth  174,081,597,005           260,836,571,057 (1.498x)  251,548,510,858 (1.445x)
  kb                    113,945,898,760           179,633,203,899 (1.576x)  136,450,793,255 (1.198x)
  soli                  1,069,822,905,043         1,640,541,482,411 (1.533x) 1,073,678,761,274 (1.004x)
  binarytrees5          130,470,446,317           195,276,263,749 (1.497x)  158,298,506,061 (1.213x)
  alloc                 1,282,744,491,682         2,064,392,411,405 (1.609x) 1,612,854,371,669 (1.257x)
  lists                 6,153,104,169             6,883,900,743 (1.119x)    6,582,283,553 (1.070x)
  finalise              23,288,098,487            31,858,883,756 (1.368x)   25,418,880,165 (1.091x)
  TOTAL                 3,277,730,063,534         5,099,233,680,866 (1.556x) 3,801,366,697,400 (1.160x)

=== [stock] CYCLES (min CPU cycles, lower is better) ===
                        trunk                     rust-runtime              rust-runtime-nightly
native
  almabench             4,275,133,794             4,266,969,984 (0.998x)    4,306,909,109 (1.007x)
  bdd                   8,423,078,413             8,336,325,514 (0.990x)    8,299,324,772 (0.985x)
  crout_decomposition   2,436,557,980             2,452,598,643 (1.007x)    2,499,609,328 (1.026x)
  durand_kerner_aberth  3,069,143,209             3,026,946,007 (0.986x)    3,061,589,426 (0.998x)
  kb                    6,910,707,438             6,896,853,517 (0.998x)    6,769,334,136 (0.980x)
  soli                  14,973,590,917            15,301,024,579 (1.022x)   15,129,903,373 (1.010x)
  binarytrees5          10,631,483,033            10,480,155,164 (0.986x)   11,050,267,510 (1.039x)
  alloc                 31,649,853,386            35,786,537,383 (1.131x)   35,671,107,350 (1.127x)
  lists                 1,102,518,006             953,712,673 (0.865x)      955,703,816 (0.867x)
  finalise              1,878,606,838             1,754,635,089 (0.934x)    1,750,940,930 (0.932x)
  TOTAL                 85,350,673,014            89,255,758,553 (1.046x)   89,494,689,750 (1.049x)
bytecode
  almabench             28,150,819,846            25,755,551,403 (0.915x)   22,276,857,902 (0.791x)
  bdd                   146,021,329,177           200,336,983,620 (1.372x)  138,615,168,879 (0.949x)
  crout_decomposition   37,529,521,884            41,904,393,924 (1.117x)   33,079,620,789 (0.881x)
  durand_kerner_aberth  66,297,703,676            77,048,128,983 (1.162x)   68,526,496,042 (1.034x)
  kb                    44,399,403,914            64,233,052,062 (1.447x)   46,199,410,939 (1.041x)
  soli                  495,570,241,232           617,323,022,780 (1.246x)  406,057,473,311 (0.819x)
  binarytrees5          46,566,265,069            72,906,173,460 (1.566x)   47,896,412,861 (1.029x)
  alloc                 386,859,994,733           698,738,688,990 (1.806x)  371,099,084,024 (0.959x)
  lists                 1,687,035,843             1,840,186,059 (1.091x)    1,673,067,897 (0.992x)
  finalise              7,894,225,414             10,423,868,878 (1.320x)   7,145,578,643 (0.905x)
  TOTAL                 1,260,976,540,788         1,810,510,050,159 (1.436x) 1,142,569,171,287 (0.906x)

OCaml native-only with flambda

=== [flambda] WALL CLOCK (min ms, lower is better) ===
                        trunk                     rust-runtime              rust-runtime-nightly
native
  almabench             788.42 ms                 778.92 ms (0.988x)        776.47 ms (0.985x)
  bdd                   1701.32 ms                1713.61 ms (1.007x)       1706.10 ms (1.003x)
  crout_decomposition   356.73 ms                 361.88 ms (1.014x)        354.48 ms (0.994x)
  durand_kerner_aberth  560.64 ms                 566.03 ms (1.010x)        562.88 ms (1.004x)
  kb                    1286.42 ms                1277.64 ms (0.993x)       1254.64 ms (0.975x)
  soli                  2846.62 ms                2835.48 ms (0.996x)       2826.60 ms (0.993x)
  binarytrees5          1974.28 ms                2083.74 ms (1.055x)       1985.55 ms (1.006x)
  alloc                 6614.50 ms                6621.29 ms (1.001x)       5895.50 ms (0.891x)
  lists                 201.42 ms                 179.15 ms (0.889x)        180.03 ms (0.894x)
  finalise              333.90 ms                 316.36 ms (0.947x)        312.57 ms (0.936x)
  TOTAL                 16664.24 ms               16734.09 ms (1.004x)      15854.81 ms (0.951x)

=== [flambda] INSTRUCTIONS (min retired, lower is better) ===
                        trunk                     rust-runtime              rust-runtime-nightly
native
  almabench             12,034,850,839            12,098,421,833 (1.005x)   12,172,354,394 (1.011x)
  bdd                   27,423,697,883            27,857,846,511 (1.016x)   27,756,041,045 (1.012x)
  crout_decomposition   7,427,485,867             7,414,807,272 (0.998x)    7,414,600,792 (0.998x)
  durand_kerner_aberth  10,049,245,201            10,055,781,084 (1.001x)   10,056,665,095 (1.001x)
  kb                    16,031,096,938            16,117,382,051 (1.005x)   16,319,595,413 (1.018x)
  soli                  80,312,687,759            80,310,809,241 (1.000x)   80,313,394,595 (1.000x)
  binarytrees5          35,947,315,834            37,275,057,576 (1.037x)   39,466,919,214 (1.098x)
  alloc                 130,580,050,276           130,480,782,692 (0.999x)  130,470,459,850 (0.999x)
  lists                 4,060,601,508             3,674,872,614 (0.905x)    3,675,307,359 (0.905x)
  finalise              6,527,081,899             6,188,708,550 (0.948x)    6,066,414,273 (0.929x)
  TOTAL                 330,394,114,004           331,474,469,424 (1.003x)  333,711,752,030 (1.010x)

=== [flambda] CYCLES (min CPU cycles, lower is better) ===
                        trunk                     rust-runtime              rust-runtime-nightly
native
  almabench             4,310,869,190             4,271,067,276 (0.991x)    4,273,574,788 (0.991x)
  bdd                   9,216,952,701             9,282,752,592 (1.007x)    9,249,625,080 (1.004x)
  crout_decomposition   1,892,517,849             1,921,876,466 (1.016x)    1,889,816,433 (0.999x)
  durand_kerner_aberth  3,033,087,434             3,058,715,626 (1.008x)    3,050,420,028 (1.006x)
  kb                    7,011,708,143             6,956,097,326 (0.992x)    6,830,973,136 (0.974x)
  soli                  15,289,634,946            15,222,697,901 (0.996x)   15,185,371,100 (0.993x)
  binarytrees5          10,696,099,880            11,287,126,205 (1.055x)   10,758,391,992 (1.006x)
  alloc                 35,697,684,634            35,701,748,838 (1.000x)   31,670,157,017 (0.887x)
  lists                 1,085,485,730             959,634,997 (0.884x)      970,192,157 (0.894x)
  finalise              1,797,386,524             1,694,521,695 (0.943x)    1,683,679,531 (0.937x)
  TOTAL                 90,031,427,031            90,356,238,922 (1.004x)   85,562,201,262 (0.950x)

Alright, that’s it. Thanks for reading.

Footnotes

Very online joke re: this comment on the vibe-coded OCaml DWARF PR.
Most work was done with Claude Opus 4.7. Opus 4.8 was released when we were almost done. The tail-end of the work was done with that.
Although, thanks to frequency illusion, now that I’ve read about it, I’m starting to notice skiplists in more places. Do see the associated paper!

That’s very interesting! I’d like to try it, but the https://github.com/mbacarella/rustcaml link is giving me a 404 error. Maybe the repository is currently private?

I wonder if your performance analysis could also suggest ways to speedup the C runtime a bit. (For example: maybe you could play with explicit tail-calls as a replacement for computed gotos in C, and see how that works. Or maybe looking at the parts of the runtime that are faster in Rust could also reveal low-hanging optimization potential on the C side.)

Regarding the choice of language, I have wondered on the occasion whether one should consider rewriting the OCaml runtime in something else than C. My experience as a non-C-expert hacking on the OCaml runtime is that it is actually reasonable most of the time (it is a problem domain where being forced to do things very simply is not bad). I would see the following benefits from moving to another language:

Some small/local niceties, such as the ability to return several things from a function. (I wouldn’t mind proper algebraic datatypes either.)
Easier access to third-party data structure implementations (currently we have to reimplement everything internally; we don’t have, for example, a concurrent hash table, because it is work to implement, even though it would occasionally be a good data-structure for some things), better genericity (reusing a data structure at several different types).
The availability of formal-methods tools for very advanced testing (say model-checking) or even formal verification of parts of the runtime. My impression is that we should try to follow best practices, including adopting those methodologies. But: in fact C has equal or better tooling on this front than most languages (say symbolic execution, model-checking, static analysis, deductive verification: C has more mature tools than other systems languages.)
More generally I believe in trying to use the best available technology, and it is plausible than there are better systems programming languages than C these days.

On the other hand, the following could cause issues:

We probably want a language that is portable, widely available. I understand that there has been some pushback against Rust adoption (was it in python-crypto?) due to this. But maybe the cat is already out of the bag and in practice there may be an ongoing convergence between “architectures that are not completely dead” and “architectures on which Rust can run”.
I am mindful of the dependency cone, and I don’t think that it is wise to put LLVM as a strict dependency of OCaml, as we would if using Rust. (There is cranelift, but it is unclear that it will ever become a realistic alternative for performance-conscious usage.) Zig seems to be doing a bit better in rooting out LLVM as a hard dependency.
We don’t have the maintenance power to follow a language that breaks compatibility regularly, so I don’t know that Zig would be a reasonable choice.

Obviously C is missing on the ADT front, but to me it is one of the languages which makes it easiest to return several things from a function. Although of course I may not be seeing things in the same way as you are: I consider that OCaml can only return one thing from its functions, even though this thing may be a tuple or a record, while I consider C’s ability to return structs (not pointers) the real deal, returning several values without boxing them in some way. (The limitation on the OCaml side is not that strong; OxCaml removes it, and Flambda2 can sidestep it in some cases without the extra features of OxCaml.)

It seems https://github.com/mbacarella/rustcaml leads to a 404?

You are thinking from an efficiency perspective, but the hassle of having to declare a struct is too much for me most of the time, so instead I use out-variables (which are also slightly annoying to use in C) with some decision fatigue on what to return. I would prefer to return a tuple or an anonymous structure.

There’s always the option of declaring a struct inline:

struct point { int x, y; } make_point (int x, int y)
{
  return  (struct point){ .x = x, .y =y };
}

SPARK/Ada fits those preferences. (And it is also my belief that it is a better systems programming language than C.) There is also currently a translation between SPARK and OCaml via Why3, and SPARK’s formal verification tool (GNATprove) is written in OCaml.

Ada has an ISO standard specification, and the governing body, The Ada Rapporteur Group, is extremely reluctant to make any backward incompatible changes. In the past when there were minor incompatible changes, compilers supported the earlier version of the standard via command line options. I’ve seen reports of thirty or forty year old code compiled with new compilers, even on different hardware, with only minor changes (such as compiling software originally written on a Vax/VMS system). There are several compilers (though most are non-free). The main free compiler, GNAT, is part of the GNU compiler collection so doesn’t rely on LLVM. The commercial compiler, ObjectAda, from PTC, is currently offered free to open source projects. Another free compiler, HAC, is in development but is far from implementing the full specification yet.

Like with OCaml the strict typing discipline of Ada can save one from lots of errors even without SPARK, and the compiler-writers of GNAT have been increasing including the ability to warn statically of more and more things via static analyses built in to the compiler.

Ada does not have multiple return values. One can either declare a record type and return a record or have multiple out parameters. An out parameter cannot be read in the procedure body, only written to, and it abstracts from the implementation: One does not have to use pointers for that as one would in C.

Ada does not have ADTs but it comes close with record and enumeration types. Ada does not have pattern matching, but the GNAT compiler has experimental “structural pattern matching” (I believe with exhaustiveness checking). Unfortunately, it doesn’t seem close to standardization, and the GNAT compiler is the only one currently implementing the latest version of the standard anyway.

The latest version of the standard contains lambda expressions and filter-map-reduce expressions.

The package system of Ada reminds me of ML modules, and private types remind me of abstract types. Ada Generics remind me of modules together with functors.

I’m not sure what you had in mind. I searched for concurrent hash tables for Rust, and I saw that there are two approaches, sharding and lock-free. I’m not aware of any lock-free hash table implementations for Ada. Ada has built-in concurrency primitives (since Ada '83, the first standardized version), and the sharding approach seems like it would be easy to implement.

Note that tail-calls are not a replacement for computed gotos. It is tall-calls + a very adhoc ABI.

Indeed, if you are using the standard calling convention, then some registers are marked as callee-saved, which means that they will have to be saved and restored around every bytecode instruction. Moreover, the bytecode interpreter will need to pass a bunch of context through function arguments (or use an indirection, which is just as bad), which is more than your ABI is likely to support for register passing. Both reasons explain why you need a special calling convention, which disables all the callee-saved registers and which passes as many function arguments as possible by register.

That is precisely what __attribute__((preserve_none)) was designed for. But you will need a very recent GCC compiler for this attribute. (For LLVM, it has been there for a bit longer.)

Coming back to the original subject, I recently stumbled (again) upon this paper: https://dl.acm.org/doi/10.1145/3737696. It dates from November, and explores the challenges of using automated tools (including LLMs) to translate C code to Rust. I would be curious to know if the work presented in this thread was influenced by this paper or other public works on the subject, and if not, if knowing about it would have changed the way this translation of the OCaml runtime was done.

Yes accidentally still had it private. Should have been fixed.

I was not aware of this paper before beginning, though I was aware of C2Rust.

I approached this differently than what they would prescribe. I wanted to do a non-idiomatic, line-by-line port first in the interest of not bending the code out of shape from the upstream OCaml. My concern is it would be easier to follow upstream that way. Though having sat with this a bit, and iterated on other projects, and seeing the life cycle of testing and also watching LLMs simply generate ad hoc theories and testing them against the upstream OCaml as an oracle, it seems like we could now be more ambitious and convert a lot more of the C-like-Rust into idiomatic Rust while maintaining semantics (relatable toy example: look for opportunities to make the lock-free skiplist idiomatic Rust, or find a safe concurrent data structure).

But! This runtime is not really like the median C project. A lot of the unsafety is unavoidable because of the runtime’s value heap-pointer/long-value duality. To encapsulate the unsafety we’d need to turn each value into something like enum Value { Long_val(i63), Block_pointer(...) } and having to destructure that every time it’s inspected would be taxing.

I think knowing how this project would have turned out two weeks after starting it would have changed my approach towards maybe actually telling Claude to just greenfield a new runtime in Very Proper Rust and grind away at it until it all tests we can think of pass. It would have taken longer but been more interesting.

ETCs aside, there is some apparent (but small) performance improvement in the Rust version along the GC slow path that I haven’t pinned down. You kind of need benchmarks that specifically grind the GC to see it though.

It would be interesting to tease out what it is, exactly, and see if the C version can do it too as -O3 and -flto does not close the gap.

Lobsters · 28 赞 · 3 评讨论 → 阅读原文 →

这条对你有帮助吗？

A line-by-line translation of the OCaml runtime from C to Rust

A line-by-line translation of the OCaml runtime from C to Rust

More seriously

Wait, isn’t OCaml written in OCaml?

A word on methodology

Benchmarks

2015 unsafes

More musings

Claude’s carefulness discipline was good but not great

Tests were done

As a pedagogical tool

Inline assembly, for compatibility!(?)

There’s a rust-runtime-nightly branch, though

But why?

Where do we go from here?

Appendix A: Full benchmarks (aka “eh, just about parity, kinda”)

OCaml native and byte-code, no flambda

OCaml native-only with flambda

Footnotes

2015 `unsafe`s

There’s a `rust-runtime-nightly` branch, though