Hacker News
Show HN: Gemini can now natively embed video, so I built sub-second video search
I used this to build a CLI that indexes hours of footage into ChromaDB, then searches it with natural language and auto-trims the matching clip. Demo video on the GitHub README. Indexing costs ~$2.50/hr of footage. Still-frame detection skips idle chunks, so security camera / sentry mode footage is much cheaper.
macNchz
|next
[-]
The presence of cameras everywhere is considerably more concerning than the status quo, to me at least, when there is an AI watching and indexing every second of every feed—where camera owners or manufacturers or governments could set simple natural language parameters for highly specific people or activities notify about. There are obviously compelling and easy-to-sell cases here that will surely drive adoption as it becomes cost effective: get an alert to crime in progress, get an alert when a neighbor who doesn't clean up after his dog, get an alert when someone has fallen...but the potential implications of living in a panopticon like this if not well regulated are pretty ugly.
citruscomputing
|root
|parent
|next
[-]
[0]: https://www.axon.com/products/axon-fusus [1]: https://citizen.com/
Ajedi32
|root
|parent
|next
|previous
[-]
The problems start cropping up when you get things like Flock where governments start deploying cameras on a massive scale, or Ring where a single company has unrestricted access to everyone's private cameras.
cake_robot
|root
|parent
|next
|previous
[-]
sohamrj
|root
|parent
|previous
[-]
cloogshicer
|next
|previous
[-]
Imagine a Premiere plugin where you could say "remove all scenes containing cats" and it'll spit out an EDL (Edit Decision List) that you can still manually adjust.
danbrooks
|next
|previous
[-]
simonreiff
|next
|previous
[-]
emsign
|next
|previous
[-]
nclin_
|root
|parent
|next
[-]
RobotToaster
|root
|parent
|next
|previous
[-]
draw_down
|root
|parent
|next
|previous
[-]
BrokenCogs
|root
|parent
|previous
[-]
kamranjon
|next
|previous
[-]
sohamrj
|root
|parent
[-]
Would love to see open-weight models with this capability since it would eliminate the API cost and the privacy concern of uploading footage.
ygouzerh
|next
|previous
[-]
dev_tools_lab
|next
|previous
[-]
SpaceManNabs
|next
|previous
[-]
If there is text on the video (like a caption or wtv), will the embedding capture that? Never thought about this before.
If the video has audio, does the embedding capture that too?
sohamrj
|root
|parent
[-]
7777777phil
|next
|previous
[-]
Cool Project, thanks for sharing!
Aeroi
|next
|previous
[-]
sohamrj
|root
|parent
|next
[-]
a bit expensive right now so it's not as practical at scale. but once the embedding model comes out of public preview, and we hopefully get a local equivalent, this will be a lot more practical.