This is fantastic! Speedometer 1.0 was a breath of fresh air, and 2.0 was a much-needed refresh, but it's really been showing its age in recent years. 3.0 looks like a solid upgrade with many new kinds of sub-tests, contemporary frameworks, etc.
I'm looking forward to sucking at this, and then slowly and systematically improving. :^)
Good luck! FYI, there's a hidden developer menu that's handy for browser developers to change the number of iterations, select specific tests, etc: https://browserbench.org/Speedometer3.0/?developerMode, and ?startAutomatically avoids needing to click the button to start the tests.
> This is the first time the Speedometer benchmark, or any major browser benchmark, has been developed through a cross-industry collaboration supported by each major browser engine: Blink/V8, Gecko/SpiderMonkey, and WebKit/JavaScriptCore.
The only other (semi) alive browser engine today is Servo, originally by Mozilla (and the reason Rust was created for), which is these days a Linux Foundation project funded by Igalia.
There are small web engines anymore. Every other one, from khtml to presto to trident, is dead.
That was it, thanks! I thought I'd controlled for this by using a Private Window, but I'm getting 25.9 with extensions disabled (compared to 17.9 in my original post).
Supported in Firefox for *12 years* now, in Chrome for 10, still no support in Safari.
They only "support" Opus audio in their special snowflake '.caf' container, which is super buggy and the last time I checked no open source program could even generate Opus '.caf' files that could be played by Safari on all Apple platforms. I ended up writing a custom converter which takes a standard '.opus' file and remuxes it on-the-fly (I only store '.opus' files on my server) into Safari-compatible '.caf' files, taking special care to massage it so that it avoids all of their demuxer/decoder bugs. You shouldn't have to do this to have cross-browser high quality audio!
> That's because none of those samples are Opus files, except the last one.
Ooof, I didn't even imagine that the official examples were WAV files. Here's an Opus audio file that plays fine in Safari on macOS and iOS: https://kur-static.biblica.com/audio/GEN_001.webm (Note: I have no idea what this content is, but could not find any English Opus content in the wild.)
> Here's an Opus audio file that plays fine in Safari on macOS and iOS
Yeah, that has Opus packed into a Matroska container (which people usually use only for videos and not pure audio). I suppose that's another good way of getting around the problem!
Just go to home page https://wpt.fyi/ see chart "Browser-specific failures are the number of WPT tests which fail in exactly one browser." Safari leads by a longshot with over 3800 tests failing only in Safari. Firefox has 1700 and Chrome less which kinda correlates to my own personal development experience.
Interop is only a tiny subset of the entire suite of WPT tests, and it only contains tests that all vendors agreed upon, so no browser will look bad in Interop.
If you look at the full WPT test suite [1], you'll see that Safari is by far the one failing the biggest number of tests, i.e. the most buggy browser.
The Safari team likes to use Interop to trick people into thinking Safari is as good as the others. It's just a PR play.
> If you look at the full WPT test suite [1], you'll see that Safari is by far the one failing the biggest number of tests, i.e. the most buggy browser.
In Safari's case, most WPT test fails mean "hasn't been implemented yet".
> Interop is only a tiny subset of the entire suite of WPT tests, and it only contains tests that all vendors agreed upon…
Exactly. If you're happy building "Works with Chrome" web apps, Safari is not for you.
"Browser-specific failures are the number of WPT tests which fail in exactly one browser." From wpt.fyi
In other terms, WPT test failures for Safari means Safari has bugs or unsupported features that both Firefox and Chrome do not have.
As for Interop, it focuses on a specific, very limited areas, like "scrolling" or "subgrid" and is in no way representative of the overall feature set of a browser.
So no, contrary to what you're implying, it's not that Chrome is too advanced, or doing too much, it's really Safari that is buggy and lagging behind both Chrome and Firefox (by a lot).
Safari has 0 extensions installed, and Firefox 0 extensions installed besides uBlock Origin. Benchmarks were run with each browser as the sole application open and plugged in to a power supply.
This collaboration is pretty exciting. I would expect that the teams of all three rendering engines (WebKit, Blink, Gecko) have done whatever they could to improve performance for the launch and that there won't be any outliers at the beginning with all of them having similar performance.
But the title of future performance king is up for grabs! And now we have a de-facto standard for browser performance benchmarking.
Higher is better. The analogy is speed. You want more speed.
It's not a physical speed, just a benchmark number. Think of it as arbitrary units, which allows you to compare different version of browsers on the same machine.
On the other hand, premature optimization is the root of all evil.
> Think of it as arbitrary units, which allows you to compare different version of browsers on the same machine.
That's precisely the problem. It's arbitrary, meaningless. Without any physical units, I don't know what's good or bad, fast or slow. And why do the scores go from 0 to 140 when the web browsers are all getting approximately 20?
> On the other hand, premature optimization is the root of all evil.
The web ecosystem is extremely mature and widely used. The workloads are fairly well understood. It is a magic unit, but the factors that go into it have a lot of thought from real-world scenarios. Bringing up "premature optimization" is completely irrelevant because that's not what this is, it's about as far as you can get from that.
> that's not what this is, it's about as far as you can get from that.
I don't know what it is. How exactly does the score relate to the experience of the web browser user?
I'm a browser extension developer, and I've occasionally had people ask me about Speedometer scores, but I have no idea what they're supposed to mean or what to tell these people.
Speedometer measures web app responsiveness. Roughly, it simulates a series of user operations on web apps built with various frameworks (as well as vanilla JS), and measures the time it takes to complete them and paint the results to the screen.
The score is a rescaled version of inverse time - if it goes up, that implies the browser can handle more user operations per second, or alternately, it takes fewer milliseconds to complete a user operation in a complex web app.
We know that, but you haven't said anything specific about scores other than higher scores are faster, in an abstract sense, which has already been established.
"The score is a rescaled version of inverse time" is the key here.
If you run all the tests in half the time, your Speedometer score will double. If your score improves by 1%, it implies that you are 1% faster on the subtests.
(There are probably some subtleties here because we're using the geometric mean to avoid putting too much weight on any individual subtest, but the rough intuition should still hold.)
Benchmarking is hard. It is very easy to write a benchmark where improving your score does not improve real-world performance, and over time even a good benchmark will become less useful as the important improvements are all made. This V8 blog post about Octane is a good description of some of the issues: https://v8.dev/blog/retiring-octane
Speedometer 3, in my experience, is the least bad browser benchmark. It hits code that we know from independent evidence is important for real-world performance. We've been targeting our performance work at Speedometer 3 for the last year, and we've seen good results. My favourite example: a few years ago, we decided that initial pageload performance was our performance priority for the year, and we spent some time trying to optimize for that. Speedometer 3 is not primarily a pageload benchmark. Nevertheless, our pageload telemetry improved more from targeting Speedometer 3 than it did when we were deliberately targeting pageload. (See the pretty graphs here: https://hacks.mozilla.org/2023/10/down-and-to-the-right-fire...) This is the advantage of having a good benchmark; it speeds up the iterative cycle of identifying a potential issue, writing a patch, and evaluating the results.
This doesn't say anything about what the scores mean.
21 is apparently better than 20, but how much better? You could say "1 better", tautologically, but how does that relate to the real world?
Driving a car 1 mile per hour faster may be better, in a sense, but even if you drove 24 hours straight, it would only gain you 24 total miles, which is almost negligible on such a long trip. Nobody would be impressed by that difference.
Percentages are rarely informative without an absolute reference.
A 5% raise for someone who makes $20k per year is $1k, whereas a 5% raise for someone who makes $200k is $10k, which would be a 50% raise for the former.
You've demonstrated you understand how to use the score to compare both inter-browser performance (analogous to the amount each makes per year) as well as individual browser performance improvements (analogous to the amount of the raise). Seems pretty informative to me?
> "The score is a rescaled version of inverse time" is the key here.
> If you run all the tests in half the time, your Speedometer score will double. If your score improves by 1%, it implies that you are 1% faster on the subtests.
> (There are probably some subtleties here because we're using the geometric mean to avoid putting too much weight on any individual subtest, but the rough intuition should still hold.)
That's irrelevant. The speedometer reading is an absolute reference. The percentages being discussed are simply comparisons, and they're only being discussed to say "they behave like you'd expect."
To directly answer your original question: a reading of 21 is 5% better than a reading of 20 because 21 is 5% greater than 20, and this means that a 21 speed browser should do things 5% faster than a 20 speed browser.
To... itself? Go measure something. You now have a reference!
If you scratch out the labels of your car speedometer and forget which is which, it still measures speed. 80 is still 33% faster than 60, regardless of the units.
I suspect your questions would be answered better by playing around with the tool in question for a few minutes anyway, as you seem to be asking about capabilities the tool does not purport to have.
I guess that’s why it’s fairly interesting to see scores thrown out in this thread on random hardware. It’s anexdata, but gives a sense of the spread/variance of scores for common platforms.
I don’t think this is a number that is ever going to make much sense for consumers to use because without this sort of context it’s just going to be like the spinal tap ‘this one goes to 11’ sort of problem.
They say something about the speed of the browser, so it doesn't really make sense to ask extension developers about it, I don't think. Possibly that your extension might make the browser slower, so you could compare scores with and without the extension and see whether it negatively affects performance? (Although I'm not sure it can necessarily tell you anything about to what extent it affects performance, only that it does.)
The score goes from 0 to 140 so that there's some room for when the computers will get faster.
When we started working on this, all browsers were maxed at 140, so the computation got changed.
I thought the front page goes to 140 just because it is modeled after actual GM dashboard speedometers produced ~1960-1990, sometimes having range 0-85 MPH, or 0-140km/h in metric markets.
The speedometer graphic was inherited from Speedometer 2. When Speedometer 2 was released, scores were in a reasonable car-speed range. The combination of hardware and software improvements meant that early versions of Speedometer 3 (which includes a subset of Speedometer 2 tests) were consistently scoring above 140, so we adjusted the scaling factor (IIRC, by ~20x) to give plenty of room for future improvements.
Nothing actually stops the score from going higher than 140, it will just max out the visual dashboard at that point. On Speedometer 2, Safari on M3 Macs ended up over 500. At scores that high it’s harder to have intuition, thus the changed scale of the new test.
Hopefully at some point actual click latency will be fixed in general after some dark decades.
Still incredible that a gameboy or an 80's computer with a CRT feels more responsive than most devices these days.
Bring back tactility. I'm convinced the choppiness and weird waits are actually psychologically stressing us out. That's why good keyboards + old low latency OS'es or typewriters are so soothing to use.
I don’t see noticeable lag on pressing/tapping buttons or other ui components in day to day browsing, even on my quite old iPhone.
There are obviously ways to make delays in web content anyway (user action->synchronous network request being the canonical one), but assuming there’s nothing silly like that lag isn’t an issue I’ve noticed.
Actual execution latency is something I worked on for many years in JSC, and so there are a lot of engine optimizations to reduce that latency as much as possible (the interpreter itself, the interpreter performance, byte code caches, hilarious amounts of lazy parsing and source skipping, etc) so even the first time you have a ui element trigger code there shouldn’t be any significant delay.
Obviously if a developer makes poor choices there’s only so much you can do, but by and large there aren’t that many bad things a web developer can do that a native dev can’t also do (and devs in both environments frequently do :-/).
My iPhone 11 scores 16.8, which really highlights how much of a lead Apple opened up during Qualcomm’s complacent decade.
I think that matters less for benchmark wars between tribes, fun as those always are, than as a stark reminder that any of us building for the public should remember than an S24 is _really_ fast for an Android phone and your median user probably bought whatever was on sale a couple years ago. That means that if you’re one of the many developers using an iPhone which isn’t roughly a decade old, you have no idea how your app feels to the median user because your device can run so much more code before it feels sluggish.
Yeah this tracks, Apple clearly have sold their souls to the devil to get the performance they have on iOS. It's basically the sole reason I have an iPhone.
> The primary goal of Speedometer 3 is to reflect the real-world Web as much as possible, so that users benefit when a browser improves its score on the benchmark.
As with any other benchmark its results will be interpreted incorrectly and will have little effect on real world.
Google already has vast amounts of real-world data. The end result? "Oh, you should aim for a Largest Contentful Paint of 2.5 seconds or lower" (emphasis mine): https://blog.chromium.org/2020/05/the-science-behind-web-vit... Why? Because in real world the vast majority of sites is worse.
Browsers are already optimised beyond any reasonable expectation. Benchmarks like these focus on all the wrong things with little to no benefit to the actual performance of real-life web.
Make all benchmarks you want, but then Google's own Youtube will load 2.5 MB of CSS and 12 MB of Javascript to display a grid of images, and Google's own Lighthouse will scream at you for the hundreds of errors and warnings Youtube embed triggers.
Optimise all you want, and run any benchmarks you want for the "real world", but Lighthouse will warn you when you have over 800 DOM nodes, and will show an error for more than 1400 DOM nodes (which are laughably small numbers) for a reason: https://developer.chrome.com/docs/lighthouse/performance/dom...
1. These companies themselves don't practice what they preach. No matter how fast Speedometer 3 is, Google's own web.dev takes three seconds to load a list of articles, and breaks client-side navigation. Google's own Lighthouse screams at you for embedding Youtube and suggests third-party alternatives [1]
2. The DOM is a horrendously bad, no-good, insanely slow system for anything dynamic. And apps are dynamic.
There's only so much you can optimise in it, or hack around it, until you run into its limitations. The mere fact that a ToDo app with a measely 6000 nodes is called a complex app in these tests is telling.
And the authors of these tests don't even understand the problem. Here's Edge team: "the complexity of the DOM and CSS rules is an important driver of end-user perceived latency. Inefficient patterns, often encouraged by popular frameworks, have exacerbated the problem, creating new performance cliffs within modern web applications".
The popular frameworks go to extreme lengths to not touch the DOM more than it is necessary. The reason the DOM and CSS end up being complicated is precisely because apps are complex, and the DOM is ill-equipped to deal with that.
This only goes to further show that browser developers have very little understanding of actual web development. And this is on top of the existing problem that web developers have very little understanding of how fast modern machines are and how inefficient web tech is.
This brings us neatly to point number 3:
3. Much of the complexity on the web in the modern web apps is due to the fact that the web has next to no building blocks suitable for anything complex.
https://open-ui.org was started 3(4?) years ago by devs from Microsoft and you can see just from the sheer number of elements and controls just how lacking the web is.
So what do you do when you need a proper stylable control for you app? Oh, you "use inefficient patterns by modern frameworks" because there's literally no other way.
And even if all of those controls do end up being implemented in browsers, it will still not be enough because all the other things will still be unavailable: from DOM efficiency to ability to do proper animations to ability to override control rendering to...
I see really insane outliers on some tests, some of the time, and this seems to kill the score on my platform (Firefox 122 on Linux x86_64).
As an example "NewsSite-Next" has 8 out of 9 repetitions between 215 and 236 ms, but 1 out of 9 (iteration 5) is 1884 ms. This is such a radical outlier I have trouble believing it could be a browser bug. Visually, the interface seems to get "hung" when switching between tests sometimes. I don't have a great explanation for that.
This specific issue in this case in with NewsSite-Next/NavigateToUS which reports a 144.5% variance as a result of this one outlier.
I see several others like this in the results, although none quite as extreme.
Probably hardware dependent too as I get 8.11 on my fairly old Samsung S21. My newer laptop gets 14.9 so I'm not sure whether I agree performance is "very bad" on mobile -- it may be less performant but that is surely to be expected given the hardware constraints on mobile. YMMV.
If you’re on iOS, Apple gates Firefox from using JIT JS compilation which massively hinders performance.
E: I was wrong/extremely out-of-date - it does have JIT but relies on the Safari/Webkit implementation. In ancient versions of iOS, the WebView widget that third-party browsers were forced to use had JIT disabled, but that’s long since changed.
For this benchmark I get 12.0 in FF and 13.9 on safari. I’m glad it’s not so big of a gap as I already pay a penalty for not wanting to use safari on iOS (in terms of integration with iOS and usability from Apple’s artificial limitations on third party browsers)
Right, but expecting the same behaviour from "Firefox" on iOS as on desktop is just not going to happen, since they have no control over the core engine. It's why, in general using iOS devices for cross-browser testing is pretty useless.
This is a fair point, though it is possible for app-level things that the browsers do to regress performance from the baseline pure engine level.
In this case, I think the 3 score must be either very old/low-end Android hardware or a measurement error. I don’t think any iOS browser gets 3.x scores, on even remotely modern hardware.
Why is the scale 0-140? My modern windows 10 desktop using FireFox latest gives 15.0/140 with no other programs running besides FF and discord. Surely 15 is horrible in that context?
I have 1 extension, ublock origin, allowed to run on the site by default.
I have never felt performance has ever lacked, outside of a few outlier sites (youtube, facebook, twitch). But those are tightly coupled with their (crappy) implementations.
15 is quite correct, definitely not horrible, so don't worry too much about that. It's not a rating, more a way to compare how browsers and hardware improve over time.
The scale goes up to 140 so that there's some space for software improvements as well as future hardware.
By far the biggest speed complaint I have about Firefox is not in everyday use, but whenever I restore a previously saved session - it basically stops reacting for a few minutes(!) before eventually I can use it again. I suppose it's due to the anti-virus interfering with some kind of memory image or whatever but whatever it is, it's so annoying.
If anyone was going to, it'd probably be me and my ~300 tabs, but I haven't run into this, either. My phone currently has 435 tabs open in Firefox but it's just as responsive as ever.
I would if I didn't have to first create an account. Cross-logging in with an existing account from a different website isn't an option for me either, unfortunately.
Got "Infinity" after testing my Firefox Dev Edition 123b9. Is this because of my FF config because my browser is perhaps blocking something (e.g. canvas, fingerprint, etc) or any result north of 140 is considered infinity?
>because my browser is perhaps blocking something (e.g. canvas, fingerprint, etc) or any result north of 140 is considered infinity?
I vaguely remember there's a privacy protection that rounds timer information. eg. all timers get rounded to the nearest 100ms. If you have a bunch of tests that take less than 100ms to complete, those tests might seemingly complete at the same time they start, which causes them to have infinite score.
I also got Infinity, on Firefox 123.0.1, with a bunch of privacy extensions.
There is only one warning in the Console:
Ignoring ‘preventDefault()’ call on event of type ‘wheel’ from a listener registered as ‘passive’. react-dom.production.min.js:29:112
This is because browsers on iOS do not 'use Safari' but use WebKit and there is huge amount of browser app software built on top of it, which will contribute to variance on benchmarks (and also to these being very different browsers ultimately).
On my machine, Firefox got 12.3, and Edge (Chromium) got 12.8. I don't believe that the performance characteristics of these two are that close unless I'm missing something. For example, audio players on Edge stutter a lot while Firefox plays them smoothly. An example is: https://deepsid.chordian.net/ I believe Edge is slower not because Chromium is slow, but because of Microsoft's overreaching efforts on energy conservation.
Machine: AMD 5950X, 32GB RAM, 3080 GPU, Windows 11 Pro 23H2
Firefox v123.0.1
Edge v122.0.2365.80
EDIT: Interesting, I tried both in private windows later to bypass extensions, and Edge got 10.8 this time, Firefox got 16.9. I now have more questions.
I ran v3 on my machine while listening to "Voyage" by "Yahel & Eyal Barkan" in Chrome and doing a bunch of background stuff. The background stuff took up about 20% of my CPU. While testing, the music played perfectly without any buffer underrun pops.
Ran it in each browser one at a time while the music played in Chrome.
Chrome 122.0.6261.112: 21.3 +/- 0.64
Edge 122.0.2365.80: 20.1 +/- 0.78
Firefox 121.0.1: 18.5 +/- 0.75
Machine specs: Intel Core i9 12900k (24 core) / 64GB RAM / 3080Ti / Windows 11 Pro 23H2
After finishing the tests, I played that same song on Firefox and Edge. Both Firefox and Edge played it perfectly.
> audio players on Edge stutter a lot while Firefox plays them smoothly
I'm curious about what could be leading to this inconsistency as I use Web Audio for a number of projects, so I have a bit of a vested interest. It is notoriously easy to do WebAudio wrong or to do just a bit too much computation which leads to buffer underruns (pops). It also may have a lot to do with specific tracks on DeepSID, could you share some tracks that perform inconsistently for you?
I think Edge's problems come from some kind of power efficiency setting, not necessarily performance-related. (Like a low-granularity JS timer, or something like that)
EDIT: Turning off all efficiency settings on Edge didn't make any difference: 11.0
Audio wouldn't be going via the DOM or JS, right? I know that Firefox has its own codec support and that Safari on Mac uses different AV stuff than other browsers.
I don't think that AV stuff would be tested by Speedometer.
It's more than channeling audio files to browser's codecs though. A SID player for example runs a 6502 CPU emulator and a SID chip emulator on the browser using JS. So, it's problematic in such scenarios. Otherwise, I can watch Youtube videos, or listen to Internet radios without issues.
> don't think that AV stuff would be tested by Speedometer
It probably isn't, but fwiw yes web audio is controlled by JavaScript. Doing it right means using web audio worklets, which is a special purpose JS context that has no access to your main page context.
I tested this with Firefox stable release and Brave stable release, 3 runs on each. Same exact extensions across both.
Highest scores across tests:
Firefox: 6.34 +- 0.31
Brave: 11.3 +- 0.37
on Ryzen 9 7940HS + RTX 3060 mobile
Which really sucks since I highly prefer Firefox but this past week I've been trying out Brave and I think its noticeably faster and smoother to me. Even with the reduced speed I'm still swayed toward Firefox for the customization factor you can achieve with userchrome.css file.
The 'same' extensions on each browser control for your experience, but not for the browsers' performance:
The browsers have the same or very similar APIs for the extensions but that is just the interface; each browser executes the extension's instructions differently (a lot or a little - I don't know the browsers' code). The same extension will impact Brave's performance differently than it will impact Firefox's. In other words, the same extension is not, in this sense, the 'same' on each browser.
In this sense, an extension is part of the user experience, like a website. The Speedometer test suite doesn't include those extensions (I assume) and that is the experience the browsers are optimized for.
The parent's test doesn't represent that; it does represent their desired experience, of course.
Even if you have the exact same extensions the fact that you have an old Firefox profile may be hindering the results. Try comparing with a fresh Firefox profile with the same extensions.
We should stop "speed-shaming" browsers and focus on websites, which negate all performance improvements made by browser developers by adding more useless features.
When that metric is "performance in real workloads", I can't imagine it ever becoming irrelevant. Just look at their new tests:
> In particular, we added new tests that simulate rendering canvas and SVG charts (React Stockcharts, Chart.js, Perf Dashboard, and Observable Plot), code editing (CodeMirror), WYSIWYG editing (TipTap), and reading news sites (Next.js and Nuxt.js).
> We’ve also improved the TodoMVC tests: updating the code to adapt to the most common versions of the most popular frameworks based on data from the the HTTP Archive. The following frameworks and libraries are included: Angular, Backbone, jQuery, Lit, Preact, React, React+Redux, Svelte, and Vue; along with vanilla JavaScript implementations targeting ES5 and ES6, and a Web Components version. We also introduced more complex versions of these tests which are embedded into a bigger DOM tree with many complex CSS rules that more closely emulate the page weight and structure from popular webapps today.
Improving these benchmark results will at least partially make those libraries faster in the real world, and most likely also many additional libraries and workloads.
The article even links to real-world performance measurements that are completely separate from the benchmark. I have no idea what the purpose of the OP's comment was. I guess they just wanted to blurt out the first cynical thing that came to mind. Awesome contribution.
I love the fact that an honest question is met with such hostility. I knew there was a quote, but was unable to think of it well enough for a proper search. It is easier to find an answer from other people with just a shard of detail that search engine will not find as there's no SEO for the broken fragment.
I'm so happy to see this place is alive and well with the attitude to support a curious mind. What a tosser
I've probably spent too much time on the internet, then, because I definitely wouldn't have interpreted your original post as an honest question without having seen the clarification in your followup comment. Probably a defense mechanism built from past pain caused by my assuming good faith and then being ridiculed from the non-honest-question-asker.
But with that nastiness out of the way - I looked back at your original question and thought "seems like the kind of query Kagi would eat for breakfast". Here's what it responded to your question: https://kagi.com/search?q=What%27s+the+quote+about+optimizin...
Specifically, the "what" at the start and the "?" at the end triggered the LLM-powered quick answer at the top (and it passes the smell test for correctness).
To Google's credit, it also returns reasonable results for this question.
Probably Goodhart's law: when a measure becomes a target, it ceases to be a good measure.
To my mind, it means that the metric becoming the main focus, it makes easy to forget the original relevant goal and even works against it. That is not the case here.
I'm looking forward to sucking at this, and then slowly and systematically improving. :^)