Hacker News

About memory pressure, lock contention, and Data-oriented Design

57 points by vinhnx ago | 9 comments

hinkley |next [-]

> Legend says it only happens in obscure, low-level systems, but I'm here to refute the legend.

In a somewhat self-deprecating, but also an “anyone can do this if they have the will” fashion, I tend to say my real skill in performance tuning is stubbornness. Orphaned gains can be easily responsible for 1/2 of the runtime of your average operation, sometimes as high as 2/3. The first 20% is fairly straightforward, the rest is just hard work. There’s always four smartasses willing to misquote a Knuth aphorism they never understood, but the fact is if your customers hate how slow the product is, and more to the point, if your marketing people are talking about it, it’s time to get off your asses and do the hard work, because your shit is broken.

Yes there’s specialized knowledge but is it really any harder than understanding how databases work? Probably not. But if what is described I’m the quote above is a “legend”, then either things have got a lot worse while I wasn’t looking, or I’m underplaying the value of those skills in the overall mix, or both.

The biggest memory pressure problem I ever fixed, I deduped running the same DB query twice and intersecting two filters of that data. I was expecting about a 2x improvement, but instead I got 10x. In retrospect I should have hoped for maybe as high as 4x due to the extra short circuiting on the data set, but not 10x. That was all better data locality.

The irony of that situation is that I had been pulled in as a fixer and asked if I could figure out how to get a request taking thirty seconds down to three. This was supposed to be my opening salvo in a several week drill down and a series of improvements going from 15 to 10 to 8 to 7 seconds etc and prepping to apologize when I was only able to get it down to five, maybe six seconds. And instead I was done in like two days. But it did serve my larger thesis that we were using caching wrong. Data passed on the call stack doesn’t have to be looked up a second time. Or a third, or a fifth.

asQuirreL |next |previous [-]

Post can be summarised quite succinctly:

Everything was slow because sorting was taking a lot of time. Sorting was slow because its comparator was taking ~6 read locks on every comparison, and was cloning large structures to avoid holding the lock for a long time. The first fix was to access just the information needed to avoid the clones, the second fix was to cache exactly the data needed for sorting after the underlying data was updated, and use that for the comparators without needing to take the underlying lock.

I'm looking forward to the next post about how cache consistency is tough.

hinkley |root |parent [-]

By far my favorite feature of lodash is the sortby function, in which instead of providing a custom comparator as most std libraries’ sort() function offers, provides a way to substitute a simpler object for the comparator to chew on. If your comparator needs to go beyond a couple nested conditionals, to actual data transform or grabbing locks, then that nasty little logn term in the runtime can take your sort from using 20% of the time budget for an expensive operation to taking 120%. Especially when you consider the memory bandwidth sloshing around L3 or the main memory bus to do the full table scan logn times.

I think the world would be better off if this wasn’t in a third part library in a single programming language.

bob1029 |next |previous [-]

> So, yes, it takes time to read from memory. That's why we try to avoid allocations as much as possible.

This whole thing is summed up by some pretty basic physics. What you actually want to minimize is the communication of information between physical cores. Nothing else really matters. Certainly not terminology or clever tricks that effectively try to cheat thermodynamics. The cost of communicating information is almost always much more substantial than the cost of computing over that same information. The ALU is not nearly as power hungry as infinity fabric.

chuzz |root |parent |next [-]

Always wondered if it is possible/useful to model multi core CPU caches as CISC cores where the complex instructions are all the operations you can do with the L1 within the time frame of an L2 access

mgaunard |root |parent |previous [-]

What the hell do allocations even have to do with time to read from memory...

It seems all of these Rust performance warriors have no understanding of computer architecture fundamentals.

hinkley |root |parent |next [-]

If an allocation falls in a forest and never gets written to, did it really happen?

We don’t allocate things we don’t write to. Maybe if the question sounds stupid then you didn’t appreciate it, instead of other people being stupid.

menaerus |root |parent |previous [-]

It's not rust per se. It's a human trait. People make uneducated assumptions about the topics all of the time, and it's completely normal since your have to bootstrap yourself somehow to be able to grow further. It's impossible to know everything. And it's also a difficult situation because you do have to arrive to some conclusion. I do think, though, that in this particular case, assumptions made are a little bit all over the place and the tone is overconfident.

pastescreenshot |previous [-]

The useful distinction here is not just AoS vs SoA, it is moving expensive work off the hot path. The biggest win in the article seems to be caching the sort/filter inputs so lock-taking and cache misses happen on updates, not during every comparison. That is a very transferable lesson even if you never go full data-oriented design.