More

sohamrj · 2026-03-31T14:26:06 1774967166

Last week, I posted SentrySearch, a CLI for semantic video search using Gemini's embedding API. The #1 request was local model support.

Turns out Qwen3-VL-Embedding can natively embed video into the same kind of vector space, no API, fully offline. Runs on Apple Silicon (MPS) and NVIDIA GPUs (CUDA). The 8B model needs ~18GB RAM, or use the 2B model on smaller machines.

sentrysearch index /path --backend local

Also added: similarity threshold to suppress weak matches, and a Tesla metadata overlay that renders speed/location onto matched clips.

Details on the README.

sohamrj · 2026-03-25T14:18:24 1774448304

ur not wrong

sohamrj · 2026-03-25T02:12:35 1774404755

haha no i'm driving the tesla and that clip is from the left repeater camera (teslas record from all around the car)

sohamrj · 2026-03-24T22:32:54 1774391574

yea, it's so events on a chunk boundary still get captured in at least one chunk. i haven't had the chance to do formal benchmarks on overlap vs. no-overlap yet. the 5s default is a pragmatic choice, long enough to catch most events that would otherwise be split, short enough to not add much cost (120 chunks/hr to ~138). also it's configurable via the --overlap flag.

sohamrj · 2026-03-24T22:25:48 1774391148

dashcam is just one of the use cases and the one i tested on. but this could theoretically work with any kind of video footage like home security footage

sohamrj · 2026-03-24T20:07:10 1774382830

Yeah, this is a great idea, I’ve actually been thinking about exactly this as the next logical step.

SentrySearch already returns precise in/out timestamps for any natural-language query and uses ffmpeg to auto-trim clips. Turning that into an EDL (or even a direct Premiere plugin that exports an editable cut list) feels natural.

I’m not a Premiere expert myself, but I’d love to see this happen. If you (or anyone) wants to sketch out a quick EDL exporter or plugin, I’ll happily review + merge a PR and help wherever I can. Just drop a GitHub issue if you start something!

sohamrj · 2026-03-24T19:57:20 1774382240

I've found I have to be very specific to get the clip I'm searching for. For example, "car cuts me off" just returned a clip of a car driving past my blindspot. But, "car with bike rack on back cuts me off at night" gave me exactly the clip I was looking for.

sohamrj · 2026-03-24T18:38:00 1774377480

Thanks! Yeah that would be pretty cool, but continuous indexing would be pretty expensive now, because the model's in public preview and there are no local alternatives afaik.

This very well might be a reality in a couple years though!

jakejmnz · 2026-03-26T14:53:35 1774536815

Very cool stuff, gave me the inspiration to try it locally. Works fairly well I think: https://github.com/jakejimenez/sentinelsearch

CamperBob2 · 2026-03-25T04:19:11 1774412351

Could https://qwen.ai/blog?id=qwen3-vl-embedding be a possible local alternative?

sohamrj · 2026-03-24T18:33:59 1774377239

Totally valid concern. Right now the cost ($2.50/hr) and latency make continuous real-time indexing impractical, but that won't always be the case. This is one of the reasons I'd want to see open-weight local models for this, keeps the indexing on your own hardware with no footage leaving your machine. But you're right that the broader trajectory here is worth thinking carefully about.

mpalmer · 2026-03-24T19:06:42 1774379202

It's 2.50 an hour because Google has margins. A nation state could do it at cost, and even if it's not a huge difference, the price of a year's worth of embeddings is just $21,900. That's a rounding error, especially considering it's a one time cost for footage.

wholinator2 · 2026-03-24T19:12:01 1774379521

Right? $2.50 an hour is trivial to a Government that can vote to invent a trillion dollars. Even just 1 million dollars is the cost of monitoring 45 real time feeds for a year. I'm sure just many very rich people would pay that for the safety of their compound.

jimmySixDOF · 2026-03-24T22:28:41 1774391321

How are you getting to $2.50/hr ? The price sheet says its 0.00079 per frame.

https://ai.google.dev/gemini-api/docs/pricing#gemini-embeddi...

jjwiseman · 2026-03-24T23:46:43 1774396003

From what I see the code downsamples video to 5 fps, so 1 hour of video is 3600 seconds * 5 fps = 18,000 frames. 18,000 frames * $0.00079/frame = $14.22. A couple dollars more with the overlap.

(The code also tries to skip "still" frames, but if your video is dynamic you're looking at the cost above.)

sohamrj · 2026-03-25T01:35:00 1774402500

you're right that the code uses ffmpeg to downsample the chunks to 5fps before sending them, but that's only a local/bandwidth optimization, not what the api actually processes.

regardless of the file's frame rate, the gemini api natively extracts and tokenizes exactly 1 fps. the 5 fps downscaling just keeps the payload sizes small so the api requests are fast and don't timeout.

i'll update the readme to make this more clear. thanks for bringing this up.

jjwiseman · 2026-03-25T04:35:34 1774413334

Thanks for the details and correction.

sohamrj · 2026-03-24T17:37:35 1774373855

Not aware of any that do native video-to-vector embedding the way Gemini Embedding 2 does. There are CLIP-based models (like VideoCLIP) that embed frames individually, but they don't process temporal video. you'd need to average frame embeddings which loses a lot.

Would love to see open-weight models with this capability since it would eliminate the API cost and the privacy concern of uploading footage.