Hacker News
NUMA: Cores, memory, and the distance between them
lukax
|next
[-]
gopalv
|root
|parent
|next
[-]
Was that a Power9 or some sort of IBM machine?
Not all NUMA is the same, ccNUMA from the Intel is a different beast from the PPC version of the same.
re-thc
|root
|parent
|next
|previous
[-]
Twirrim
|root
|parent
|next
[-]
CarRamrod
|root
|parent
|previous
[-]
drunkboxer
|root
|parent
[-]
treesknees
|next
|previous
[-]
We have to explain this to customers of our software all the time, it’s something that’s easy to miss.
suprjami
|root
|parent
|next
[-]
alexzenla
|root
|parent
[-]
The part 2 is going to cover how we actually solved it, which involves every part of the system having knowledge. It's so easy to ignore but it has a massive impact on perf.
Twirrim
|next
|previous
[-]
It has been a source of routine conversations with customers and engineers of all kinds, and often one of those things you don't know about until too late.
I don't know if the kernel has improved this behaviour in the several years since last tested, but a coworker realised that the linux page-cache wasn't fully split by NUMA node. They were benchmarking mysql running it in each NUMA node, and noticed the second NUMA node was noticeably slower. Then discover after a reboot the second node was fast, and the first was slower. After a bit of thinking and tinkering they discovered that libmysql was ending up in the page cache in the same NUMA node as the benchmark client was run in first, so even though they were pinning the benchmark tool and mysql process to the NUMA node, the benchmark client was causing the OS to reach across the NUMA node to get at the page cached library.
jeffbee
|root
|parent
[-]
jpecar
|next
|previous
[-]
mickeyp
|root
|parent
[-]
> If it does not, I cannot consider it ready for this century.
Mhmm.
iofiiiiiiiii
|next
|previous
[-]
I have been dealing with the topic for a few years now and it was surprisingly hard to track down the bottlenecks to actual numbers. Some time ago I managed to find a good example to demonstrate the effect in a tangible way and wrote up an article about it. If the topic sounds interesting, you might enjoy https://sander.saares.eu/2025/03/31/structural-changes-for-4... (Structural changes for +48-89% throughput in a Rust web service).
jeffbee
|next
|previous
[-]
Amazon gets this. Except for the 4th generation their Graviton systems are not NUMA.
toast0
|root
|parent
[-]
Single socket doesn't necessarily get you away from NUMA anyway, AMD server sockets are 4 way NUMA (you can set it for interleaving, but you could do better with NUMA-aware software), and I think Intel is doing NUMA on server socket as well.
A lot of people like to take one big machine and partition it into several smaller virtual machines. In that case, it shouldn't be too hard to partition vms into NUMA zones? Only vms that are two big to fit in one zone have to worry about it (or that need to be repacked into a different zone)
jeffbee
|root
|parent
[-]
I think you can over-analyze this stuff and lose your sanity. On these multicore systems there are also hot cores in the center of the mesh and cold ones at the edges and theoretically you could be doing temperature-aware scheduling, gaining a bit more efficiency in doing so. But it's just easier to adopt the black box model of spherical frictionless CPUs.