Amazing work and people should really appreciate that the opportunity costs of your work are immense (given the hype).
On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?
I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)
I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.
Thanks! Yes we actually did think about that - it can get quite expensive sadly - perplexity benchmarks over short context lengths with small datasets are doable, but it's not an accurate measure sadly. We're actually investigating currently what would be the best efficient course of action on evaluating quants - will keep you posted!
Yes sadly very expensive :( Maybe a select few quants could happen - we're still figuring out what is the most economical and most efficient way to benchmark!
Oh it's more time that's the issue - each benchmark takes 1-3 hours ish to run on 8 GPUs, so running on all quants per model release can be quite painful.
Assume AWS spot say $20/hr B200 for 8 GPUs, then $20 ish per quant, so assuming benchmark is on BF16, 8bit, 6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model ish to $420 ish/hr. Time wise 7 hours to 1 day ish.
We could run them after a model release which might work as well.
I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarks
Small feedback if any of the Antigravity people read here: "Fast" is not a great name for the "eager" option (vs. "Planning") because "Fast" is associated with "dumb" in LLMs (fast/flash/mini). Probably "Eager" would be a more descriptive name
It’s just a detail, the international financial market/banking system is basically under active US control, just look at what happened to Wegelin & Co. (at that point the oldest bank in Switzerland) when they thought that that was not the case.
Mechanically sure, but I still feel way safer when a Tesla (of any kind) is approaching me as a pedestrian or bicyclist than any other vehicle (except maybe Waymo) because I know they will alert the driver and brake if necessary. Any other car, especially older trucks, I'm quite afraid of, based on experience.
> because I know they will alert the driver and brake if necessary.
This is not necessarily accurate.
https://x.com/TaylorOgan/status/1681240264554209281 ("Warning: Graphic; Last month, a 76-year-old pedestrian was tragically mowed down by a Tesla Model S in Brooklyn, NY. Both of his legs were torn off, according to witnesses. New data from the NHTSA says the Tesla was engaged on Autopilot/Full Self-Driving mode.")
I own several Teslas, would not trust them to stop for a pedestrian while in any driver assist mode. It may work, but if you rely on it, be prepared for consequences when it fails, as you are the responsible party when it fails.
Tesla is currently renting vehicles for $60/day due to diminished demand; if one would like to test this personally, the cost is minimal. Avoid bodily injury whenever possible during testing.
Edit: @romaaeterna Are you willing to stand in front of it while it is at speed without a safety driver? I am trying to reconcile the mental model with risk appetite and potential gaps between priors and current state.
I have a Tesla and a drive FSD back and forth to work every day. It's great
Edit in response to your edit:
Would I risk myself standing in front of a FSD Tesla versus in front of an Uber or an average human-controlled car with the standard percentage chance of the human texting or being otherwise distracted or drunk or tired? I would take FSD. And I think that a mathematical rather than emotional evaluation of the odds would make risk-minded people do the same.
You would need to compare the data against the data of non-smart trucks. I'm guessing it's an order of magnitude more dangerous to be a pedestrian around a normal truck.
Automatic emergency braking is a standard feature on many new cars, and will be mandatory for all new passenger cars and light trucks in the U.S. by September 2029. I am open to the assertion that Tesla's AEB, when scoped to pedestrian scenarios, is superior to other AEB systems, but this assertion requires independently verified data and evidence for support.
In my experience, Tesla drivers are some of the worst drivers on the road. They seem to pay the least attention to what's going on around them and are the most likely to pay fast and loose with the rules of the road. I don't know what's to account for this. There has been at least one study out of Berkeley that suggests that people who drive more expensive cars are more likely to break the rules of the road. It's possible that (at least here in Seattle), this is more likely to be the driver's first car since many people driving them are highly paid tech workers who often hail from others countries and who may not have as good of a grasp of driving in the US. Or it may be that this is enabled by autopilot itself (if your car is taking care of the safety you don't have to pay as much attention).
The last reason is the biggest imo. Previously if you didn't pay attention you would crash relatively often. Now you aren't punished in the same way. In the same way spell check made us worse spellers. You aren't required to pay attention to detail, so you never develop that skill.
I taught my kids to drive both manuals and automatics. Usually we got the hang of driving an automatic, and then added manual in to the mix.
But with one of my kids, it was exactly as above. They scared the crap out of me, because they just would not focus well enough. We transitioned to a manual so that they were required to focus on the task at hand, and they then turned into a good driver.
(Aside: my kids, now college+ age have all gotten great deals on cars on college budgets, because they were willing to take a manual that cost far less due to reduced demand).
> There has been at least one study out of Berkeley that suggests that people who drive more expensive cars are more likely to break the rules of the road.
In Germany, we have a joke - BMWs don't need turn signal indicators, they have built-in precedence that comes with paying the money one needs to have to afford a BMW.
Could you give me some numbers about deaths caused by Tesla versus other brands per mile driven? It seems to be very difficult to find enough information to draw any conclusions.
Why did you stop training shy of the frontier models? From the log plot it seems like you would only need ~50% more compute to reach frontier capability
Makes sense! I like that you guys are more open about it. The other labs just drop stuff from the ivory tower. I think your style matches better with engineers who are used to datasheets etc. and usually don't like poking a black box
On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?
I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)
I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.
reply