Here's a crucial mechanism that Paul Graham did not mention:
With a wealth tax using his calculation, the higher your returns, the lower the comparable income tax would be. If your returns are 10% you'll pay $1 on $10 capital gains which is 10% and you end up with $109. Conversely someone achieving a mere 1% cap gains would be essentially taxed for 100% of his return.
With income taxes it's usually the opposite: the more you earn, the higher the tax bracket you will be put into.
Somebody like Paul Graham surely has higher than 10% capital gains, otherwise he'd not be exactly a great investor.
Personally I'm against wealth taxes, I think capital gains taxes are a much more appropriate and fairer tool. I also think taxes in general are way too high, if you are part of the middle class and add up everything you pay in taxes, fees, insurance, duties and whatnot you can end up losing 70-90% of whatever you earn. It's extremely hard to actually accumulate wealth for the vast majority of people.
3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite. $1551 for 3.5 Flash [0] vs $892 for 3.1 Pro [1]. That's 74% more cost while ranking lower. It's 2.5x as fast but I don't think the bang for the buck is there anymore like it was with 3.0 Flash. I'm a bit bummed out to be honest.
I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.
One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.
Ouch. That's going in completely the wrong direction.
How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?
3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.
The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.
You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
Yes, that is definitely a limitation. If all models become worse at the same pace, we won't see any degradation either. I couldn't find any historical dataset of model benchmarks (I'd really have loved that, to see how performance holds over time vs. the initial announcement), so the Elo data from Arena AI was the least imperfect proxy I could find.
The relative and auto-scaling nature of Elo ranking feels like an advantage here.
Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them.
Advantage for what exactly though? I'm not saying Elo Ranking doesn't give any information. It just doesn't give the information that the OP's project claims to be able to give: that models get nerfed over time. You could extract this kind of information from the raw results of each evaluation round between two models, ignoring any new model entries and compare these over time but not from the resulting Elo scores with an ever changing list of models.
New models are on average better than older models, the average skill of the population of models increases over time and so you are mathematically guaranteed that any existing model will over time degrade in Elo score even though it didn't change itself in any way.
It's like benchmarking a model against a list of challenges that over time are made more and more difficult and then claiming the model got nerfed because its score declined.
Elo is good at establishing an overall ranking order across models but that's not what this is about.
Elo systems often include one or more ways new points can enter the system. The system used by the European Go Federation has three ways iirc: 1. Cannot go under 100, 2. Cannot lose more than 100 points in one tournament, 3. Weaker player beating a stronger one (which is countered by the stronger player beating the weaker one, but it's not balanced: if two people only play each other forever and ever, both of their Elos will grow).
Interestingly NET is down 15%-ish in extended hours trading and was even down 20% at some point. Many times a stock will make a positive move when layoffs are announced.
Cloudflare is a growing company by most metrics so if efficiencies through AI were the reason for the layoffs they'd just take the boost and grow even faster.
It all doesn't check out and I think the real reason for the layoffs and the negative sentiment by the market on the news is that their revenue growth was not as fast as their expenses and they realized they overhired. Leadership doesn't want to dive too much into the red even if it would mean bigger growth down the line. They are now beholden to the near and mid term stock performance.
I've had the chance to talk to some SWEs working at Cloudflare off the record in recent months and the one concensus I heard was that there was many times some tension between the boots on the ground and the decisions from senior managment but of course nothing they could do and especially after this they'll make sure to be quiet should they remain. There seemed to be a lot of pressure to deliver features and new products but quality has been left behind which means the SWEs felt pressure to deliver while also having to deal with the ensuing issues to resolve.
Either way I wish everyone affected the best and a speedy job hunt - there'll be quite a few really good people on the market now for no fault of their own.
Apple Silicon uses unified memory where the CPU and GPU use the exact same memory and no copies from RAM to VRAM are needed. The article opens with mentioning just that and indeed it is the whole point of the article.
I am always a bit baffled why Apple gets credited with this. Unified memory has been a thing for decades. I can still load the biggest models on my 10th gen Intel Core CPU and the integrated GPU can run inference.
The difference being that modern integrated GPU are just that much faster and can run inference at tolerable speeds.
(Plus NPUs being a thing now, but that also started much earlier. Thr 10th gen Intel Core architecture already had instructions to deal with "AI" workloads... just very preliminary)
That’s shared, not unified, it’s partitioned where cpu and gpu copies are managed by driver. Lunar lake (2024) is getting closer but still not as tightly integrated as apple and capped to 32GB only (Apple has up to 512GB). AMD ryzen ai max is closer to Apple but still 3 times slower memory.
Shared vs unified is merely a driver implementation detail. Regardless, in practice (IIUC) data is still going to be copied if you perform a transfer using a graphics API because the driver has no way of knowing what the host might do with the pointed-to memory after the transfer.
If you make use of host pointers and run on an iGPU no copy will take place.
My last serious GPU programming was with OpenCL. And if my memory does not fail me the API was quite specific about copying and/or sharing memory on a shared memory system.
I am pretty sure that my old 10th gen CPU/GPU combo has the ability to use the "unified"/zero-copy access mode for the GPU.
I don't think people are crediting Apple with inventing unified memory - I certainly did not. There have been similar systems for decades. What Apple did is popularize this with widely available hardware with GPUs that don't totally suck for inference in combination with RAM that has decent speed at an affordable price. You either had iGPUs which were slow (plus not exactly the fastest DDR memory) but at least sitting on the same die or you had fast dGPUs which had their own limited amount of VRAM. So the choice was between direct memory access but not powerfull or powerfull but strangled by having to go through the PCIE subsystem to access RAM.
The article is talking about one particular optimization that one can implement with Apple Silicon and I at least wasn't aware that it is now possible to do so from WebAssembly - so to completely dismiss it as if it had nothing to do with Apple Silicon is imho not fair.
Yes but that is just a tiny part of the whole CF worker ecosystem. The other services are not open source and so the lock-in is very very real. There are no API compatible alternatives that cover a good chunk of the services. If you build your application around workers and make use of the integrated services and APIs there is no way for you to switch to another provider because well, there is none.
And now you've put everything on the equivalent of a single NodeJS process running on a tiny VM. Next step: spread out over multiple durable objects but that means implementing a sharding logic. Complexity escalates very fast once you leave toy project territory.
D1 reliability has been bad in our experience. We've had queries hanging on their internal network layer for several seconds, sometimes double digits over extended periods (on the order of weeks). Recently I've seen a few times plain network exceptions - again, these are internal between their worker and the D1 hosts. And many of the hung queries wouldn't even show up under traces in their observability dashboard so unless you have your own timeout detection you wouldn't even know things are not working. It was hard to get someone on their side to take a look and actually acknowledge and understand the problem.
But even without network issues that have plagued it I would hesitate to build anything for production on it because it can't even do transactions and the product manager for D1 openly stated they wont implement them [0]. Your only way to ensure data consistency is to use a Durable Object which comes with its own costs and tradeoffs.
> And many of the hung queries wouldn't even show up under traces in their observability dashboard
How did you work around this problem? As in, how do you monitor for hung queries and cancel them?
> D1 reliability has been bad in our experience.
What about reads? We use D1 in prod & our traffic pattern may not be similar to yours (our workload is async queue-driven & so retries last in order of weeks), nor have we really observed D1 erroring out for extended periods or frequently.
> How did you work around this problem? As in, how do you monitor for hung queries and cancel them?
You just wrap your DB queries in your own timeout logic. You can then continue your business logic but you can't truly cancel the query because well, the communication layer for it is stuck and you can't kill it via a new connection. Your only choice is to abandon that query. Sometimes we could retry and it would immediately succeed suggesting that the original query probably had something like packetloss that wasn't handled properly by CF. Easy when it's a read but when you have writes then it gets complicated fast and you have to ensure your writes are idempotent. And since they don't support transactions it's even more complex.
Aphyr would have a field day with D1 I'd imagine.
> What about reads? We use D1 in prod & our traffic pattern may not be similar to yours (our workload is async queue-driven & so retries last in order of weeks), nor have we really observed D1 erroring out for extended periods or frequently.
We have reads and writes which most of the time are latency sensitive (direct user feedback). A user interaction can usually involve 3-5 queries and they might need to run in sequence. When queries take 500ms+ the system starts to feel sluggish. When they take 2-3s it's very frustrating. The high latencies happened for both reads and writes, you can do a simple "SELECT 123" and it would hang. You could even reproduce that from the Cloudflare dashboard when it's in this degradated state.
From the comments of others who had similar issues I think it heavily depends on the CF locations or D1 hosts. Most people probably are lucky and don't get one of the faulty D1 servers. But there are a few dozen people who were not so lucky, you can find them complaining on Github, on the CF forum etc. but simply not heard. And you can find these complaints going back years.
This long timeframe without fixes to their network stack (networking is CF's bread and butter!), the refusal to implement transactions, the silence in their forum to cries for help, the absurdly low 10GB limit for databases... it just all adds up. We made the decision to not implement any new product on D1 and just continue using proper databases. It's a shame because workers + a close-by read replica could be absolutely great for latency. Paradoxically it was the opposite outcome.
Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]
With a wealth tax using his calculation, the higher your returns, the lower the comparable income tax would be. If your returns are 10% you'll pay $1 on $10 capital gains which is 10% and you end up with $109. Conversely someone achieving a mere 1% cap gains would be essentially taxed for 100% of his return.
With income taxes it's usually the opposite: the more you earn, the higher the tax bracket you will be put into.
Somebody like Paul Graham surely has higher than 10% capital gains, otherwise he'd not be exactly a great investor.
Personally I'm against wealth taxes, I think capital gains taxes are a much more appropriate and fairer tool. I also think taxes in general are way too high, if you are part of the middle class and add up everything you pay in taxes, fees, insurance, duties and whatnot you can end up losing 70-90% of whatever you earn. It's extremely hard to actually accumulate wealth for the vast majority of people.
reply