Astro Hacker News - Performance improvements in libffi

Koromix |next [-]

Interesting, it's similar to what I've done recently in Koffi 3, which is an FFI package for Node.js. I made my own C FFI layer, I did not use libffi. This works great, and got me close to statically-implemented NAPI modules.

Benchmarks are here: https://koffi.dev/benchmarks

It's still a little different because in my case, the instructions tend to do two things: decode JS value and prepare register/stack. For typical functions, only a few instructions have to run, with minimal overhead. So, for example, I have a PushBool instruction which calls napi_get_value_bool() and then puts the bool at the correct offset (pre-computed) so that it ends up in a register or on the stack.

A function like int atoi(const char *) ends up with only two bytecode instructions:

  - PushString
  - RunInt32 (combined macro-operation that defers to assembly to set up registers, call the function, and then directly decodes the value)

Or another exemple, void *memset(void *ptr, int value, size_t size) only needs four instructions:

  - PushPointer
  - PushInt32
  - PushUInt64
  - RunPointer

I've coupled that with a tail-call direct-threaded interpreter, with Clang's __attribute__((preserve_none)) ABI, just like Python did recently: https://github.com/python/cpython/issues/128563

quotemstr |next |previous [-]

Bytecode is an awesome trick and gets used in a surprising number of situations. In Windows COM, for example, (for IPC and serialization), stubs and proxies do their marshaling by interpreting a small bytecode generated from type and signature descriptions. You end up with an artifact smaller and more convenient than AOT-compiled native code and it doesn't hurt performance in any practical way.

Notably, the COM bytecode covers not only procedure-level argument-passing, but data structure transformations themselves. It's a nice setup.

menaerus |next |previous [-]

Quite unlucky CPU to run the experiments. The article doesn't mention it but I hope that the measurement numbers OP got were extracted by re-running the experiment on same type of cores. Intel Core Ultra 7 255H is a mix of performance- (6x), efficient- (8x) and low-power (2x) cores.

fock |root |parent [-]

well, Claude likely is not really trained on benchmarking across such systems...

rurban |next |previous [-]

Oh, I thought he does this already. Why was there a prepare, when it doesnt prepare the arg decoding.

atgreen |root |parent [-]

TBH, the complexity of this step grew over time, and the overhead snuck up on us. The prep step does useful work (eg. determine stack space requirements). It's just that we don't have to do it again.

Something I should have mentioned is that we could have avoided the new APIs if only there was space in the ffi_cif to stash a plan pointer. And I didn't want to break ABIs for this.

tadfisher |previous [-]

Can we AOT-compile stubs instead of interpreting or JIT-compiling? I feel like most FFI users would call static, well-defined functions.

atgreen |root |parent |next [-]

Yes, that's part of what was done here. So, create a plan, and then for some subset of plans, create AOT-compiled templates. The analogies are: a) original implementation is like interpreting via walking a syntax tree b) building/caching an execution plan is like interpreting by executing bytecode generated from the syntax tree c) using an AOT-compiled template is like execution from qemu's old TCG template system But we only do (c) for a popular subset of function signatures. The biggest win was (b), but (c) is still an improvement over (b).

quotemstr |root |parent |previous [-]

I'd measure twice before cutting. Almost everyone not deep into cross-language interop and VM design intuits, incorrectly, that FFI mechanisms themselves drive interop costs. In practice, it's almost never the case. While, in principle, compiling a libffi signature to native code could be a win, doing so matters a lot less often than you think.

Keep in mind that optimizing the call doesn't optimize the marshaling: even with an AOT-compiled FFI trampoline, if you're, say, sending a string from one place to another, you usually need to transform the string in some manner (copy it, change encoding, add/remove length prefixes, etc.) and JITing the libffi parameter passing won't help you do the string stuff any faster.

In fact, trying to AOT the connections can make your program worse, both by bloating it (causing some likely small, but still, cache pressure) and by complicating your build and deployment process.

libffi bytecode is good. I wouldn't bother with native code until I had a profile in hand showing the bytecode to be the bottleneck, and even then, I'd check it a three or four times to make sure I didn't get the profiling wrong. FFI is just seldom the problem in real-world systems.