Hacker News .hnnew | past | comments | ask | show | jobs | submit | kamranjon's commentslogin

I'm particularly interested in it being REALLY fast - do you have any rough tok/s numbers for the flash model? I'm excited for unsloth to drop some quants that I can try and run locally, but really curious how it's been performing speed wise. In general I actually over-index on speed over intelligence. I'd rather a model make mistakes quickly and correct in a follow-up than take forever to get a slightly better initial result.

Take a look at the Time column in https://gertlabs.com/?mode=oneshot_coding -- this is the total time to complete a solution for a reasonably complex problem end-to-end (you would have to divide by avg submission size to estimate tok/s). It's fast in the sense that most of the smart, recent Chinese releases are quite slow, especially the DeepSeek Pro variant. Opus 4.7 is also quite fast.

If pure speed is most important for your use case, GPT-5.3 Chat is the fastest model we've tested and it's still reasonably smart. Not meant for agentic tool usage / long context, though.

So it might be more useful for business applications or non-engineering usage where you don't need exceptional intelligence, but it's useful to get fast, cheap responses.


This black box approach that large frontier labs have adopted is going to drive people away. To change fundamental behavior like this without notifying them, and only retroactively explaining what happened, is the reason they will move to self-hosting their own models. You can't build pipelines, workflows and products on a base that is just randomly shifting beneath you.

In what ways has fetch never caught up to axios? I have not encountered a situation where I could not use fetch in the last 5 years so I'm just curious what killer features axios has that are still missing in fetch (I certainly remember using axios many moons ago).

Missing a lot of event hooks that axios/ky give you.

Simple examples are interceptors and error handling.

Fetch is one of those things I keep trying to use, but then sorely regret doing so because it's a bit rubbish.

You're probably reinventing axios functionality, badly, in your code.

It's especially useful when you want consistent behaviour across a large codebase, say you want to detect 401s from your API and redirect to a login page. But you don't want to write that on every page.

Now you can do monkey patching shenanigans, or make your own version of fetch like myCompanyFetch and enforce everyone uses it in your linter, or some other rubbish solution.

Or you can just use axios and an interceptor. Clean, elegant.

And every project gets to a size where you need that functionality, or it was a toy and who cares what you use.


Forcing everyone to use ourFetch is rubbish, but forcing everyone to use axios is clean and elegant? You might want to elaborate just a little more.

ourFetch is more likely to be buggy, unmaintained, undocumented and nobody knows it well because the guy who wrote it left the org 2 years ago and so you have to waste time reading and maintaining it yourself.

Axios is something where you get most of that work done for you by the community for free, and a lot of people know it. As long as you don’t get pwned due to it. Oh and you will actually find community packages that integrate with it, vs ourFetch, which again, nobody knows or even cares that it exists.

Applies to web frameworks, databases and other types of software and dependencies - if you work with brilliant people, you might succeed rolling your own, but for most people taking something battle tested, something off the shelf is a pretty sane way to go about it.

In this case it’s a relatively small dependency so it’s not the end of the world, but it’s the exact same principle.


> In this case it’s a relatively small dependency so it’s not the end of the world, but it’s the exact same principle.

An alternative world-view is: "A little copying is better than a little dependency," from https://go-proverbs.github.io

Does become subjective about what "small" and "little" are though.


I also agree with this!

I think the ideal model would be being able to depend on upstream code, but being able to review ALL of the actual code changes when pulling in new dependency versions (with a nice UI) and being able to vendor things and branch off with a single command whenever you need it, so you don't have to maintain it yourself by default but it's trivial when you want to.

It's actually surprising that in regards to front end development the whole shadcn approach hasn't gotten more popular. Or anywhere else for that matter, focusing on making code way more easy to maintain, to compile/deploy, with less complexity along the way.


Exactly, I completely agree.

It's the difference between using a SQL library and some person on your team writing their own SQL library and everyone having to use it. There's a vast gulf between the two, professionally speaking.

People dissing axios probably suffer from other NIH problems too.


It's interesting that, of the large inference providers, Google has one of the most inconvenient policies around model deprecation. They deprecate models exactly 1 year after releasing them and force you to move onto their next generation of models. I had assumed, because they are using their own silicon, that they would actually be able to offer better stability, but the opposite seems to be true. Their rate limiting is also much stricter than OpenAI for example. I wonder how much of this is related to these TPU's, vs just strange policy decisions.

It's frustrating how cavalier they are about killing old Gemini releases. My read is that once a new model is serving >90% of volume, which happens pretty quickly as most tools will just run the latest+greatest model, the standard Google cost/benefit analysis is applied and the old thing is unceremoniously switched off. It's actually surprising that they recently extended the EOL date for Gemini 2.5. Google has never been a particularly customer-obsessed company...

What benefit is there to sticking on older models? If the API is the same, what are the switching costs?

Consistency, new models don't behave the same on every task as their predecessors. So you end up building pipelines that rely on specific behavior, but now you find that the new model performs worse with regards to a specific task you were performing, or just behaves differently and needs prompt adjustments. They also can fundamentally change the default model settings during new releases, for example Gemini 2.5 models had completely different behavior with regards to temperature settings than previous models. It just creates a moving target that you constantly have to adjust and rework instead of providing a platform that you and by extension your users can rely on. Other providers have much longer deprecation windows, so they must at least understand this frustration.

> Consistency, new models don't behave the same on every task as their predecessors. So you end up building pipelines that rely on specific behavior

If this is a deal breaker, then self-hosting is the only solution. Due to the hardware premium, all models hosted by 3rd-parties will be deprecated to make room for newer, better, and more efficient models.


Sure, but Google also leaves little to no overlap between models and often will leave models in preview mode (which many companies cannot use in production for legal reasons) - right up until the point that the previous model is deprecated.

The point is that if you want to build a platform that customers can rely on based on their own schedules of feature development, you need to support models for longer periods of time. For example, OpenAI is still offering older models like gpt4 which was released in 2023 - this gives customers plenty of time to test, experiment and eventually migrate to a newer model if it makes sense.


If you're trying to run repeatable workflows, stability from not changing the model can outweigh the benefits of a smarter new model.

The cost can also change dramatically: on top of the higher token costs for Gemini Pro ($1.25/mtok input for 2.5 versus $2/mtok input for 3.1), the newer release also tokenizes images and PDF pages less efficiently by default (>2x token usage per image/page) so you end up paying much much more per request on the newer model.

These are somewhat niche concerns that don't apply to most chat or agentic coding use cases, but they're very real and account for some portion of the traffic that still flows to older Gemini releases.


I've heard GenAI.mil still has Gemini 2.5 only.

Wouldn't surprise me. The best model you can get on AWS GovCloud is still Claude Sonnet 4.5.

Flash 2 isn't even at EOL until June but we started seeing ~90% error rates getting 429s over the weekend. (So we switched to GPT 5.4 nano.)

I use LMStudio to host and run GLM 4.7 Flash as a coding agent. I use it with the Pi coding agent, but also use it with the Zed editor agent integrations. I've used the Qwen models in the past, but have consistently come back to GLM 4.7 because of its capabilities. I often use Qwen or Gemma models for their vision capabilities. For example, I often will finish ML training runs, take a photo of the graphs and visualizations of the run metrics and ask the model to tell me things I might look at tweaking to improve subsequent training runs. Qwen 3.5 0.8b is pretty awesome for really small and quick vision tasks like "Give me a JSON representation of the cards on this page".

My m4 max mbp with 128gb of memory is constantly training 24/7 on weekends- it’s why I bought the thing.

I'm crossing my fingers they release a flash version of this. GLM 4.7 Flash is the main model I use locally for agentic coding work, it's pretty incredible. Didn't find anything in the release about it - but hoping it's on the horizon.


I've had really good success with LMStudio and GLM 4.7 Flash and the Zed editor which has a baked in integration with LMStudio. I am able to one-shot whole projects this way, and it seems to be constantly improving. Some update recently even allowed the agent to ask me if it can do a "research" phase - so it'll actually reach out to website and read docs and code from github if you allow it. GLM 4.7 flash has been the most adept at tool calling I've found, but the Qwen 3 and 3.5 models are also fairly good, though run into more snags than I've seen with GLM 4.7 flash.


Does anyone know of an open weights models that can embed video? Would love to experiment locally with this.


Not aware of any that do native video-to-vector embedding the way Gemini Embedding 2 does. There are CLIP-based models (like VideoCLIP) that embed frames individually, but they don't process temporal video. you'd need to average frame embeddings which loses a lot.

Would love to see open-weight models with this capability since it would eliminate the API cost and the privacy concern of uploading footage.


A quick search brought up https://qwen.ai/blog?id=qwen3-vl-embedding but I have no idea if it does what Gemini is doing here.


more or less works similarly, made a proof of concept for it: https://github.com/jakejimenez/sentinelsearch


Very cool, thanks. Will check it out.


Can you go into more detail about what you did with your investments? How did you invest in the international market?


Not op, I however, I also did that. I used vanguard's international index fund.


Pretty much this


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: