Hacker News
An overengineered solution to `sort | uniq -c` with 25x throughput (hist)
zX41ZdbW
|next
[-]
ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt
I've tested it and it is faster than both sort and this Rust code: time LC_ALL=C sort data.txt | uniq -c | sort -rn > /dev/null
32 sec.
time hist data.txt > /dev/null
14 sec.
time ch --input-format LineAsString --query "SELECT line, count() AS c GROUP BY line ORDER BY c DESC" < data.txt > /dev/null
2.7 sec.
It is like a Swiss Army knife for data processing: it can solve various tasks, such as joining data from multiple files and data sources, processing various binary and text formats, converting between them, and accessing external databases.[1] https://clickhouse.com/docs/operations/utilities/clickhouse-...
da_chicken
|root
|parent
|next
[-]
> The easiest way to download the latest version is with the following command:
> curl https://clickhouse.com/ | sh
Like, sure, there is some risk downloading a binary or running an arbitrary installer. But this is just nuts.
nsteel
|root
|parent
|next
|previous
[-]
nasretdinov
|root
|parent
|next
|previous
[-]
supermatt
|root
|parent
[-]
nasretdinov
|root
|parent
|next
[-]
noamteyssier
|next
|previous
[-]
Some notes on the improvements:
1. using csv (serde) for writing leads to some big gains
2. arena allocation of incoming keys + storing references in the hashmap instead of storing owned values heavily reduced the number of allocations and improves cache efficiency (I'm guessing, I did not measure).
There are some regex functionalities and some table filtering built in as well.
happy hacking
theemptiness
|root
|parent
|next
[-]
It is not difficult to construct scenarios where throughput matters but that IMHO that does not determine engineering vs overengineering. What matters is whether there are requirements that need to be met. Debating the requirements is possible but doesn't take away from whether a solution obtained with reasonable effort meets the spec. Overengineering is about unreasonable effort, which could lead to overshoot the requirements, not about unreasonable requirements.
mabster
|root
|parent
[-]
dbdr
|root
|parent
|previous
[-]
Could you explain that, if you have the time? Is that for writing the output lines? Is actual CSV functionality used? That crate says "Fast CSV parsing with support for serde", so I'm especially confused how that helps with writing.
donatj
|next
|previous
[-]
nasretdinov
|next
|previous
[-]
jll29
|next
|previous
[-]
Arguably, this will result in a slower result in most cases, but the reason for the rejection is wasting developer time (not to mention time to test for correctness) to re-develop something that is already available in the OS.
pbhjpbhj
|root
|parent
|next
[-]
Interviewer: "Welcome to your hammer-stuff interview, hope you're ready to show your hammering skills, we see from your resumé you've been hammering for a while now."
Schmuck: "Yeah, I just love to make my bosses rich by hammering things!"
Interviewer: "Great, let's get right into the hammer use ... here's a screw, show me how you'd hammer that."
Schmuck: (Thinks - "Well, of course, I wouldn't normally hammer those; but I know interviewers like to see weird things hammered! Here goes...")
[Hammering commences]
Interviewer: "Well thanks for flying in, but you've failed the interview. We were very impressed that you demonstrated some of the best hammering we've ever seen. But, of course, wanted to see you use a screwdriver here in your hammering interview at We Hammer All The Things."
wahern
|root
|parent
|next
|previous
[-]
However, I've never found a use for it. Apparently it was written for the Version 7 Unix build system to sort libraries for passing to the linker. And still used.[1][2] But of the few times I've needed a topological sort, it was part of a much larger problem where shell scripting was inappropriate, and implementing it from scratch using a typical sort routine isn't that difficult. Still, I'm waiting for an excuse to use it someday, hopefully in something high visibility so I can blow people's minds.
[1] https://github.com/openbsd/src/blob/17290de/share/mk/bsd.lib... [2] https://github.com/NetBSD/src/blob/7d8184e/share/mk/bsd.lib....
f311a
|root
|parent
|next
|previous
[-]
They are tricky and not very portable. Sorting depends on locales and the GNU tools implementation.
Someone
|next
|previous
[-]
It likely won’t matter much here, but invoking cat is unnecessary.
sort <file> | uniq -c | sort -n
will do the job just fine. GNU’s sort also has a few flags controlling buffer size and parallelism. Those may matter more (see https://www.gnu.org/software/coreutils/manual/html_node/sort...)
majke
|next
|previous
[-]
noctune
|next
|previous
[-]
f311a
|next
|previous
[-]
This Rust implementation uses hashmap, if you have a lot of unique values, you will need a lot of RAM.
ukuina
|next
|previous
[-]
Are there any tools that tolerate slight mismatches across lines while combining them (e.g., a timestamp, or only one text word changing)?
I attempted this with a vector DB, but the embeddings calculation for millions of lines is prohibitive, especially on CPU.
fsiefken
|next
|previous
[-]
scaredginger
|next
|previous
[-]
vlovich123
|next
|previous
[-]
gucci-on-fleek
|root
|parent
|next
[-]
$ echo c a b c | tr ' ' '\n'
c
a
b
c
$ echo c a b c | tr ' ' '\n' | sort
a
b
c
c
$ echo c a b c | tr ' ' '\n' | sort | uniq -c
1 a
1 b
2 c
$ echo c a b c | tr ' ' '\n' | sort | uniq -c | sort -rn
2 c
1 b
1 a
Aaron2222
|root
|parent
|next
|previous
[-]
sort | uniq -c | sort -n
The second sort is sorting by frequency (the count output by `uniq -c`).
flowerthoughts
|previous
[-]
There is no text encoding processing, so this only works for single byte encodings. That probably speeds it up a little bit.
Depending on the size of the benchmark input, sort(1) may have done disk-based sorting. What's the size of the benchmark input?
wodenokoto
|root
|parent
[-]
mabster
|root
|parent
|next
[-]
thaumasiotes
|root
|parent
|previous
[-]
You could do an annotating pass for learning which of each line is the last one, and then a followup pass for printing (or otherwise echoing) only the lines that are the last of their kind. Technically still faster than sorting.
You could also keep the information on last occurrence of each line in the hash map (that's where it's going to be anyway), and once you're done with the first pass sort the map by earliest last occurrence. That will get you the lines in the right order, but you had to do a sort. If the original input was mostly duplicates, this is probably a better approach.
You could also track last occurrence of each line in a separate self-sorting structure. Now you have slightly more overhead while processing the input, and sorting the output is free.