Hacker News

An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

70 points by noamteyssier ago | 41 comments

zX41ZdbW |next [-]

This and similar tasks can be solved efficiently with clickhouse-local [1]. Example:

    ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt
I've tested it and it is faster than both sort and this Rust code:

    time LC_ALL=C sort data.txt | uniq -c | sort -rn > /dev/null
    32 sec.

    time hist data.txt > /dev/null
    14 sec.

    time ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt > /dev/null
    2.7 sec.
It is like a Swiss Army knife for data processing: it can solve various tasks, such as joining data from multiple files and data sources, processing various binary and text formats, converting between them, and accessing external databases.

[1] https://clickhouse.com/docs/operations/utilities/clickhouse-...

da_chicken |root |parent |next [-]

I'd not heard of clickhouse before. It does seem interesting, but I just can't get behind a project that says:

> The easiest way to download the latest version is with the following command:

> curl https://clickhouse.com/ | sh

Like, sure, there is some risk downloading a binary or running an arbitrary installer. But this is just nuts.

gigatexal |root |parent [-]

Chdb is just a binary. You can just grab that. Also pipe to sh is used by a ton of projects

bflesch |root |parent [-]

it's used by many projects but still regarded as an anti-pattern and security issue

nsteel |root |parent |next |previous [-]

Just noting that in your benchmark (which we know nothing about), your "naive" data point is just 2.29x slower than hist. In their testing it was 27x slower! And it's not quite the same naive shell command, which isn't helpful.

nasretdinov |root |parent |next |previous [-]

To be more fair you could also add SETTINGS max_threads=1 though?

supermatt |root |parent [-]

How is that “more fair”?

nasretdinov |root |parent |next [-]

Well, fair in a sense that we'd compare which implementation is more efficient. Surely, ClickHouse is faster, but is it because it's using actually superior algorithms or is it just that it executes stuff in parallel by default? I'd like to believe it's both, but without "user%" it's hard to tell

mickeyp |root |parent |next [-]

Last time I checked, writing efficient, contention-free and correct parallel code is hard and often harder than pulling an algorithm out of a book.

reppap |root |parent |previous [-]

Would you take half the wheels off a car to compare it to a motorcycle?

gigatexal |root |parent |previous [-]

Exactly. I love this and DuckDb and other such amazing tools.

noamteyssier |next |previous [-]

Was sitting around in meetings today and remembered an old shell script I had to count the number of unique lines in a file. Gave it a shot in rust and with a little bit of (over-engineering)™ I managed to get 25x throughput over the naive approach using coreutils as well as improve over some existing tools.

Some notes on the improvements:

1. using csv (serde) for writing leads to some big gains

2. arena allocation of incoming keys + storing references in the hashmap instead of storing owned values heavily reduced the number of allocations and improves cache efficiency (I'm guessing, I did not measure).

There are some regex functionalities and some table filtering built in as well.

happy hacking

theemptiness |root |parent |next [-]

Small semantics nit: it is not overengineered, it is engineered. You wanted more throughput, the collection of coreutils tools was not designed for throughput but flexibility.

It is not difficult to construct scenarios where throughput matters but that IMHO that does not determine engineering vs overengineering. What matters is whether there are requirements that need to be met. Debating the requirements is possible but doesn't take away from whether a solution obtained with reasonable effort meets the spec. Overengineering is about unreasonable effort, which could lead to overshoot the requirements, not about unreasonable requirements.

mabster |root |parent [-]

We had similar thoughts about "premature optimisation" in the games industry. That is it's better to have prematurely optimised things than finding "everything is slow". But I guess in that context there are many many "inner-most loops" to optimise.

chii |root |parent [-]

> That is it's better to have prematurely optimised things than finding "everything is slow".

or you found that you've optimized a game that is unfun to play and thus doesn't sell, even tho it runs fast...

wongarsu |root |parent [-]

The best-practice solution would be to write a barely optimized ugly prototype to make sure the core idea is fun, then throw away the prototype and write the "real" game. But of course that's not always how reality works

dbdr |root |parent |previous [-]

> using csv (serde) for writing leads to some big gains

Could you explain that, if you have the time? Is that for writing the output lines? Is actual CSV functionality used? That crate says "Fast CSV parsing with support for serde", so I'm especially confused how that helps with writing.

donatj |next |previous [-]

I created "unic" a number of years ago because I had need to get the unique lines from a giant file without losing the order they initially appeared. It achieves this using a Cuckoo Filter so it's pretty dang quick about it, faster than sorting a large file in memory for sure.

https://github.com/donatj/unic

nasretdinov |next |previous [-]

Note that by default sort command has a pretty low memory usage and spills to disk. You can improve the throughput quite a bit by increasing the allowed memory usage: --buffer-size=SIZE

jll29 |next |previous [-]

I use questions around this pipeline in interviews. As soon as people say they'd write a Python program to sort a file, they get rejected.

Arguably, this will result in a slower result in most cases, but the reason for the rejection is wasting developer time (not to mention time to test for correctness) to re-develop something that is already available in the OS.

pbhjpbhj |root |parent |next [-]

I'm sure you're doing it in a sensible way, but... the thought that played out in my head went a little like this {apologies, I'm not well today, this might be a fever dream}:

Interviewer: "Welcome to your hammer-stuff interview, hope you're ready to show your hammering skills, we see from your resumé you've been hammering for a while now."

Schmuck: "Yeah, I just love to make my bosses rich by hammering things!"

Interviewer: "Great, let's get right into the hammer use ... here's a screw, show me how you'd hammer that."

Schmuck: (Thinks - "Well, of course, I wouldn't normally hammer those; but I know interviewers like to see weird things hammered! Here goes...")

[Hammering commences]

Interviewer: "Well thanks for flying in, but you've failed the interview. We were very impressed that you demonstrated some of the best hammering we've ever seen. But, of course, wanted to see you use a screwdriver here in your hammering interview at We Hammer All The Things."

wahern |root |parent |next |previous [-]

One of the cooler Unix command utilities is tsort, which performs a topological sort. Basically you give it a list of items (first word in each line) and their dependencies (subsequent words on each line) and it sorts them accordingly, similar to how, e.g., Make builds a graph of targets and dependencies to run recipes in the correct order. https://en.wikipedia.org/wiki/Tsort https://pubs.opengroup.org/onlinepubs/9799919799/utilities/t...

However, I've never found a use for it. Apparently it was written for the Version 7 Unix build system to sort libraries for passing to the linker. And still used.[1][2] But of the few times I've needed a topological sort, it was part of a much larger problem where shell scripting was inappropriate, and implementing it from scratch using a typical sort routine isn't that difficult. Still, I'm waiting for an excuse to use it someday, hopefully in something high visibility so I can blow people's minds.

[1] https://github.com/openbsd/src/blob/17290de/share/mk/bsd.lib... [2] https://github.com/NetBSD/src/blob/7d8184e/share/mk/bsd.lib....

f311a |root |parent |next |previous [-]

This depends on the context... If a file is pretty small, I would avoid sort pipes when there is a Python codebase. It's only useful when the files are pretty big (1-5GB+)

They are tricky and not very portable. Sorting depends on locales and the GNU tools implementation.

coldstartops |root |parent |previous [-]

> Wasting developer time

What is the definition of wasting developer time? If a developer takes a 2 hours break to recover mental power and avoid burnout, is it considered time wasted?

Someone |next |previous [-]

> I am measuring the performance of equivalent cat <file> | sort | uniq -c | sort -n functionality.

It likely won’t matter much here, but invoking cat is unnecessary.

   sort <file> | uniq -c | sort -n
will do the job just fine. GNU’s sort also has a few flags controlling buffer size and parallelism. Those may matter more (see https://www.gnu.org/software/coreutils/manual/html_node/sort...)

majke |next |previous [-]

nasretdinov |root |parent [-]

I believe, given its reliance on the Bloom filter, that it doesn't actually report occurrences count?

noctune |next |previous [-]

I built something similarly a few years ago for `sort | uniq -d` using sketches. The downside is you need two passes, but still it's overall faster than sorting: https://github.com/mpdn/sketch-duplicates

f311a |next |previous [-]

People often use sort | uniq when they don't want to load a bunch of lines into memory. That's why it's slow. It uses files and allocates very little memory by default. The pros? You can sort hundreds of gigabytes of data.

This Rust implementation uses hashmap, if you have a lot of unique values, you will need a lot of RAM.

ukuina |next |previous [-]

Neat!

Are there any tools that tolerate slight mismatches across lines while combining them (e.g., a timestamp, or only one text word changing)?

I attempted this with a vector DB, but the embeddings calculation for millions of lines is prohibitive, especially on CPU.

fsiefken |next |previous [-]

I'm curious how much faster this is compared to the rust uutils coreutils ports of sort and uniq

scaredginger |next |previous [-]

Looks like the impl uses a HashMap. I'd be curious about how a trie or some other specialized string data structure would compare here.

vlovich123 |next |previous [-]

Why does this test against sort | uniq | sort? It’s kind of weird to sort twice no?

gucci-on-fleek |root |parent |next [-]

The first "sort" sorts the input lines lexicographically (which is required for "uniq" to work); the second "sort" sorts the output of "uniq" numerically (so that lines are ordered from most-frequent to least-frequent):

  $ echo c a b c | tr ' ' '\n'
  c
  a
  b
  c
  
  $ echo c a b c | tr ' ' '\n' | sort
  a
  b
  c
  c
  
  $ echo c a b c | tr ' ' '\n' | sort | uniq -c
        1 a
        1 b
        2 c
  
  $ echo c a b c | tr ' ' '\n' | sort | uniq -c | sort -rn
        2 c
        1 b
        1 a

Aaron2222 |root |parent |next |previous [-]

  sort | uniq -c | sort -n
The second sort is sorting by frequency (the count output by `uniq -c`).

BuildTheRobots |root |parent |previous [-]

It's something I've done myself in the past. First sort is because it needs to be sorted for uniq -c to count it proper, second sort because uniq doesn't always give the output in the right order.

evertedsphere |root |parent [-]

more precisely, uniq produces output in the same order as the input to it, just collapsing runs / run-length encoding it

|next |previous [-]

flowerthoughts |previous [-]

The win here might be using HashMap to avoid having to sort all entries. Then sorting at the end instead. What's the ratio of duplicates in the benchmark input?

There is no text encoding processing, so this only works for single byte encodings. That probably speeds it up a little bit.

Depending on the size of the benchmark input, sort(1) may have done disk-based sorting. What's the size of the benchmark input?

wodenokoto |root |parent [-]

To me, the really big win would be _not_ to have to sort at all. Have an option to keep first or last duplicate and remove all others while keeping line order is usually what I need.

mabster |root |parent |next [-]

I've written this kind of function so many times it's not funny. I usually want something that is fed from an iterator, removes duplicates, and yields values as soon as possible.

thaumasiotes |root |parent |previous [-]

That's easy to do if you're keeping the first duplicate. It becomes complex if you're keeping the last duplicate, because every time you find a duplicate you have to go back through your "output" and delete the earlier occurrence.

You could do an annotating pass for learning which of each line is the last one, and then a followup pass for printing (or otherwise echoing) only the lines that are the last of their kind. Technically still faster than sorting.

You could also keep the information on last occurrence of each line in the hash map (that's where it's going to be anyway), and once you're done with the first pass sort the map by earliest last occurrence. That will get you the lines in the right order, but you had to do a sort. If the original input was mostly duplicates, this is probably a better approach.

You could also track last occurrence of each line in a separate self-sorting structure. Now you have slightly more overhead while processing the input, and sorting the output is free.