Hacker News
Apache Arrow is 10 years old
data_ders
|next
[-]
Yet today I feel it was 2016 dataders who is the crazy one lol
ayhanfuat
|root
|parent
[-]
jtbaker
|root
|parent
|next
[-]
Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
0xcafefood
|root
|parent
|previous
[-]
postexitus
|root
|parent
[-]
ayhanfuat
|root
|parent
[-]
mistrial9
|root
|parent
[-]
aynyc
|next
|previous
[-]
tosh
|root
|parent
|next
[-]
feather is optimized for fast reading
dionian
|root
|parent
|previous
[-]
aynyc
|root
|parent
[-]
HoldOnAMinute
|next
|previous
[-]
pm90
|next
|previous
[-]
aerzen
|next
|previous
[-]
mempko
|next
|previous
[-]
kccqzy
|root
|parent
|next
[-]
thinkharderdev
|root
|parent
[-]
actionfromafar
|next
|previous
[-]
It's very neat for some types of data to have columns contiguous in memory.
skeeter2020
|root
|parent
|next
[-]
That's not really the purpose; it's really a language-independent format so that you don't need to change it for say, a dataframe or R. It's columnar because for analytics (where you do lots of aggregations and filtering) this is way more performant; the data is intentionally stored so the target columns are continuous. You probably already know, but the analytics equivalent of SQLite is DuckDB. Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.
mandeepj
|root
|parent
|next
[-]
Not sure if I misunderstood, what are the chances those different consumers / tools / operations are running in your memory space?
daddykotex
|root
|parent
|next
[-]
You still have to transfer the data, but you remove the need for a transformation before writing to the wire, and a transformation when reading from the wire.
cestith
|root
|parent
|next
|previous
[-]
The key phrase though would seem to be “memory representation”m and not “same memory”. You can spit the in-memory representation out to an Arrow file or an Arrow stream, take it in, and it’s in the same memory layout in the other program. That’s kind of the point of Arrow. It’s a standard memory layout available across applications and even across languages, which can be really convenient.
nu11ptr
|root
|parent
|next
|previous
[-]
tosh
|root
|parent
|next
|previous
[-]
You can also store arrow on disk but it is mainly used as in-memory representation.
data_ders
|root
|parent
|previous
[-]
it's actually many things IPC protocol wire protocol, database connectivity spec etc etc.
in reality it's about an in-memory tabular (columnar) representation that enables zero copy operations b/w languages and engines.
and, imho, it all really comes down to standard data types for columns!