All hardware is open if you're willing to open it up ;)
But reverse engineering isn't as easy as my snarky response implies. But I do think more of us should get into hardware hacking. It's the only way we have to fight back. I'm tired of this "own nothing" paradigm and being forced into whatever dumb thing they want is to do. And it's so dumb too. There's not many power users but there's a disproportionate amount of resources dedicated to fucking us over
I wonder if these people are just avoiding thinking about the tough things in their lives.
I wonder if these people are just scared of being human, so reaching for any distraction they can get.
I've tried to stop taking my phone with me when I go to the bathroom. When I shower. When I go to bed. Because I think we all have these same addictions. There's things that suck in life. But maybe if we put our phones down we can work together to solve these things.
The author says he has two kids, which would likely constrain what he is able to accomplish in his free time.
Children are financially dependent on the parents to provide for them. There's not really much way around that. It makes sense that if you can do more things within the time that is left that people will try to figure out how to cram those things in. What we would have resigned to give up in the past now seems possible to attain with enough AI credits and tools.
I do know a lot of people who love to talk. I don't think it's a character flaw. It's certainly not what I want, and I would die if I had to talk all day, but it's just the way they prefer to communicate. Same way that some people are introverts and some are extroverts, some people like reading paper books and some people like audiobooks.
I like how in spite of the author explaining why (father of two small children that occupy his free time), you jumped to the most negative set of possibilities. Instead, it sounds like when he's with his children, he is focusing on them instead of on productivity, which is the opposite of what you're suggesting.
Also, if he instead chose to occupy his drive time with listening to a comedy podcast, or NPR, or even a technical podcast, I can't help but imagine you wouldn't give it a second thought, in spite of that being just as "productive" and "avoiding thinking about the tough things".
Surprisingly difficult to do. The assumption is that there are some things we do in our life that act as a blank space that must be filled with something. Productivity, or deep thought, or whatever. People go through life always doing something so there's never a "wasted" moment, but I'd argue thats a recipe for burnout and unhappiness.
There's a buddhist concept of suchness, seeing things exactly as they are in the present without judging them or trying to change them. Doing anything else but "just driving" is trying to live somewhere other than where you actually are. Where ever you are, and whatever you are doing right now is what life is, your life isn't somewhere else in the future, and you don't need to escape from a mundane task and rush somewhere else to experience life. All of it is life, even the boring parts.
I have not very well treated AuADHD and being alone with my thoughts very long is generally not very productive, At least Coding LLMs have helped me get things I wouldn't of had the attention span to make in the past come to life. and a good bit of vibe coding is just yelling at the LLM that what its doing sounds good on paper so keep going, please do the needful, make no mistakes :V
I think people miss that this is a possibility. Yet it's fairly easy to see that when more people become homeless more bicycles get stolen off the street. Idk about you, but I'd have to be pretty desperate to steal a bicycle. Idk about you, but if I was living on the street struggling you find food, I'd be pretty desperate. I mean FFS you don't even have a stove to cook ramen. And where are you going to get the money to afford a camping stove?
Yes, and there are sometimes many layers to it, which is why you can think "cool, I get that" while still missing something important that would be obvious to an expert.
> IQ is about aptitude and credentials on specific topics are about knowledge and skills.
Meaning it can be learned. Trained.
I'm not defending the metric. People use it like it is some innate thing that doesn't change over one's lifetime. In fact, a college education is a great way to increase your IQ.
It's also important to note that IQ is normalized. An IQ of 100 today is different than an IQ of 100 20 years ago. Notable, it's been increasing, so someone taking an IQ test in the year 2000 getting an IQ of 100 would have had an IQ of 130 had they taken it in 1950. That's an incredibly important piece of information needed to even do basic comparisons of IQs
> Meaning it can be learned. Trained. […] In fact, a college education is a great way to increase your IQ.
You make this argument on the assumption that the effect is causal. But in reality one cannot distinguish whether education raises IQ or whether people with higher IQs stay longer in college.
Whether things like "intelligence", "cognitive ability", and "aptitude" (some of which may be synonyms depending on your view) are innate vs. learned or fixed vs. variable over time are orthogonal to each other. And for each of those pairs, the answer may not be as simple as a binary division or even a gradient (it may decompose into something weirder, being causally determined by multiple factors where some of those factors are fixed and others aren't).
Moreover, both of those questions are separate from questions that get at what IQ measures (does it measure aptitude, does it measure factual knowledge, does it measure social knowledge or acculturation within a specific context, etc.).
Lots of things are easy to identify as both substantially genetically determined and variable over time and mediated by environmental factors, e.g., height. Lots of things are likewise easy to identify as significantly environmentally determined but also largely stable over time if not altogether fixed (e.g., personality, attachment styles).
It's also at least possible for all of the following to be true at the same time:
- IQ tests correlate with socioeconomic status
- IQ test scores vary over time and can be increased
- some IQ score increases, or some part of a given IQ score increase, reflects a genuine aptitude increase
- IQ tests are somewhat gameable in that training for IQ tests can increase scores so that some of the measured increase does not measure improved cognitive ability
where aptitude means something like fluid problem-solving ability, speed of learning, etc.
I think this is a more important comment than people might take it for.
We all want meritocracy. Really. But the problem is that meritocracies are never really meritocratic. The problem is that it's actually really hard to measure these things. It looks simple at first glance, but once you dive into things it starts to change.
Let's change your example above and ignore cheating. Let's say there's no cheating. The rich and well off still tend to have the advantage. Let's even pretend that a rich person and poor person goes to the same school, in the same class. It's more likely that rich person will get extra tutoring for those exams. The more important those exams are, the more valuable those tutors become (allowing them to charge more and more).
Are there not test taking strategies? The mere existence of this should tell you that the test is measuring something more than knowledge.
I'm just using this as a simple example but I'd encourage others to think more deeply about it because these things do matter if we're going to try to make a meritocracy. I'm not saying we shouldn't try, but I'm saying one of the most critical parts to creating a meritocracy is recognizing the limitations in the metrics. It's an alignment problem and Goodhart always comes back to bite you. As soon as you become complacent you drift further from meritocracy.
Meritocracy will always be a dream. We should chase our dreams, but we need to recognize the difference between dreams and reality. You'll never make those dreams come true if you can't
The danger of a meritocracy is in the word. What do you merit? Your job? Fair enough. More rights? Certainly not. I'm afraid it's easy for some to start viewing others as lesser because they don't merit one's position, consequently one's status and thus should not have a seat at the important tables because after all they don't "merit" it.
What I want ultimately is that we strive to give a better life to everyone. And I don't think that's what meritocracy achieves.
> I'm afraid it's easy for some to start viewing others as lesser
We already do this and we've done it throughout history too. There's always some excuse people will make to feel better than others. Wealth, religion, race, intelligence, education, all sorts of things.
But we do want high social mobility. If you work hard it is easy to climb the ladders. If you squander your wealth it is easy to slip. I'm not saying there should be no friction, the correct balance is always hard to find.
But whatever that merit is is something we need to decide as a society. It can be anything we want. It can be your work that contributes to monetary growth. It could be work that contributes to scientific growth. It could be how great of an artist you are. How popular you are. Our anything. We decide and we decide how much one means more than the other. Or we could even decide that there are no "lessers" and we could decide that the person traveling the world on their parent's dime has the same value to our society as a scientist, businessman, or artist. Mind you, I'm not talking about their value as a human, that's different
Well, meritocracy isn't just who gets the jobs. It's who gets the jobs that run society. And that's important for everyone, because it matters to everyone that society be competently run, rather than run by incompetents who have important parents.
Standardized testing so far is the worst solution, except for all the others.
Sure, wealthy people can pay for standardized testing prep. However, test prep is a much lower barrier than having to pay for exotic experiences abroad to pad admissions essays or connections to gain political exposure so you know the appropriate shibboleths to utter or racial features to highlight.
My point implies that you have to be dynamic. If your evaluation methods are static for too long they get hacked. You have to balance that though. Change too often and your system is overly expensive and cumbersome. Change too infrequently and the cheaters end up at the top.
You're right that it can always be worse but you're wrong to say that we can't do better now
I think people say they want a meritocracy, but they actually mean "everyone can succeed", which are different. In a meritocracy where everyone is trying hard (like in asian cultures), then hard work is not enough, not everyone can succeed. In America, there is some slack so hard workers can succeed with below average genetics (which is why, practically, meritocracy="everyone can succeed") but I think things are changing as competition is increasing.
This is just a thought bubble, but it makes a certain amount of sense as to why the current administration is so dead-against DEI initiatives. Whilst they say it's about merit (and we all know it's not - just look at almost all of the appointees), it's actually about the added barriers in the way of assigning the individual (not the 'type of person') that they would choose for the role, based on their personal network of contacts and / or those who have made "charitable contributions" (which probably brings them into the fold of personal network contacts anyway).
DEI quotas make it hard to bring along a whole team of boot lickers.
Or in the case of this administration, window lickers.
I find this study quite suspect. I'd have to dive deeper but there's definitely significant alarm bells that should be going off for anyone reading.
Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol
There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?
I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over
Independent of whether it has any meaning (because the entire paper might be a bit iffy), I find it curious that Instructors 3 and 8 have the lowest harmfulness rates, quite a bit lower than even the LLMs, but not the highest preference rates. Harmfulness anticorrelates with preference, but not perfectly. Some amount of charisma appears to be a factor even in selections by professionals?
The paper says the professors have a median of 200 comparisons each. It also says they only used 2 models because using more models would require more comparisons and they selected Google models because Google was branded/advertised as being education focused. When you see other models show up elsewhere, that's because they extended the main idea to other models but using LLMs to judge instead of human professors.
Sure, but the biggest problem is they have no statistical significance. Variance is too high. How do you distinguish the signal from the noise? Confidence intervals aren't enough.
But is it a surprise law professors aren't great statisticians?
I disagree. 16 isn't necessarily the relevant N here but the number of responses is.
If you have 100 responses from 1 professor, and the AI wins 75% of the time that is very likely a true signal that the AI is better than this prof. It would be incorrect to generalize this to all profs though.
Further, if you sample 16 profs and the AI beats 10 of them you can be fairly certain that the real percentage of profs it beats isn't 10%. Further, when estimating the probability that the AI beats a random prof, it's the relative estimation error that scales with 1/sqrt N. If you have a coin and it lands heads up 16 times, that tells you something quite robust about the coin.
Reasonably estimating confidence intervals at small N and high p is not trivial. But it can be done.
A good heuristic is "add 2 successes and 2 failures" which is due to Agresti & Couli.
I think it is more likely that they selected Gemini because the lead author is a fellow at an institute which receives a lot of their funding from Google.
Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.” In another two years it’s going to be curtains.
The issue is, it almost always outperforms knowledge workers.
IF the right questions are asked, and IF steered into and corrected at a few crucial points. IF not it goes off in the wrong direction really quick and that's a problem that's still mostly unsolved in the last 2 years.
And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.
I help run a few marketing websites where I let the CEO's run crazy with Claude cowork, they are making PR's like a madman, but they are not allowed to touch any of the API's & platforms where there is real user data & sensitive information.
Ya, while the tools are really solid and have seen huge leaps these past two years, in no way will an LLM be able to do any of it unguided in two years. Just a humble opinion that I would love to see be wrong.
"in no way will an LLM be able to do any of it unguided in two years"
IDK "not any of it" seems a bit strong, especially thinking towards 2028. For a lot of knowledge professions, there is a surprising amount of tasks that are just dumb work compared to the rest.
There's a huge difference between one shot and few shot versus building a robust harness with deterministic and adversarial quality gates. And I'm finding that agents can actually do a pretty good job of a surprising number of things if you are very clear about your dimensions of quality and the rubrics that you get agents to research and then use to validate against those dimensions of quality.
Make sure to use a deterministic pipeline or harness to go step by step so agents aren't checking their own work and I sometimes get alpha from having a codex check the work of a clod but I am seeing pretty good output across multiple domains when I have three independent quality gates and a loop which only spits it out to a human if it doesn't converge at a reasonable cost.
> Just a humble opinion that I would love to see be wrong
Out of curiosity, why would you love to be wrong about that? What possible outcome could you see being a net positive for society if the vast majority of knowledge workers (and ultimately, as robotics progress, most workers in general) are replaced by AI?
I believe it was Blink-182 who said, "Work sucks". You have to pay people to do that stuff; they don't want to be there. And then you get into second order effects- costs plummet for anything labor intensive, including medical care, prepared food, cleaning, and private tutors. Then onto tertiary effects- if you can spin up a million genius researchers to attack a problem, you start seeing massive progress in every important area and it isn't tied to population growth.
I get that you might have a 'UBI/alternative general welfare is impossible' up your sleeve, but you've written this like it's somehow unfathomable that not forcing everybody to work just to survive would be a good thing. Of course it would be good! It's just a matter of dealing with the (huge) side effect of lost income.
In that scenario, AI would have to be a public utility, which it is not. Private corporations have no intention to provide services for public good. If they displace a billion jobs, they'll just throw up their hands and go "we're just an Ai company guyz"
> I believe it was Blink-182 who said, "Work sucks". You have to pay people to do that stuff; they don't want to be there.
Believe it or not, some people actually do enjoy their jobs and work they do.
> I get that you might have a 'UBI/alternative general welfare is impossible' up your sleeve, but you've written this like it's somehow unfathomable that not forcing everybody to work just to survive would be a good thing.
UBI absolutely is unfathomable here (US). The USG won't even give people health care. People go bankrupt to afford life saving care on a regular basis. Or just die... Even if those cases are a minority, just the fact that it happens says a lot. So I do think it is unfathomable that UBI would be implemented here. I don't think that's unreasonable to say.
I think a lot of the time when this debate occurs (which, at this point, literally every single day I see something about this) UBI is almost always the contention point but I feel like that's really not the end-all... Like, sure, say there's a miracle and we have UBI get instituted. I think that is maybe 25% of the solution. The other problem is now you're going to have an entire class of people basically living without a purpose. Yeah, I get it, they can go and "explore their passions" and focus on "creative works" or whatever BS people persuade themselves into thinking the vast majority of society would want to do, but realistically I think there would be a huge psychological breakdown in people now living without a fundamental purpose in society.
Somewhat related to that -- I was just this weekend watching a YouTube essay about PTSD in knights back in the medieval times, and the main point made in the video is that the psychological impacts incurred by the knights after battle were not just from seeing fucked up shit... the most apparent and serious cases of "PTSD" occurred when a knight was injured enough on the battle field resulting in them no longer able to be soldiers. Their entire purpose in the world got stripped away resulting in serious psychological stress. I think that same issue would apply to many people today (lawyers, engineers, investment bankers, etc) who would no longer be able to practice their craft. (This is the video for reference, was a good watch https://www.youtube.com/watch?v=849dmdc-Qf8)
I understand the counter argument to this is going to be some anti-capitalist rhetoric like "Well people shouldn't live to be workers and that's fucked up that they have live that way!" but IMO, some people like what they do and don't want to be made useless. (Not implying that is what you were insinuating, but just in a broad sense I that genera of argument doesn't make sense to me)
In a way, we are betraying something here. My reading is: solving the social problems of capitalism feels so impossible, that reducing the need for anyone to do work is a liability. In a way this sentiment should make extremists of us all?
Unfortunately the fact is that society has some massive imbalances around capitalism
It is not hard for me to imagine a world where if my bosses didn't need me, they would prefer me to be dead than to pay me some kind of permanent income to me. They would prefer to keep that power to themselves
These are already the sort of people who will happily lay you off into a recession, leave you without a way to pay your rent or for food if it improves their bottom line. They do not care if you starve. Or at least they care less than they do about their quarterly bonus
So no, I don't trust these fucks to continue playing nice if they view my value as going to zero
Yeah it can do things unguided if the tests to confirm its correctness are very solid. Thats where a lot of progress has been made and where agents are good, but this is domain specific, and a chance where startups can shine.
> And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.
Which also happens with humans – does it do so at a lower rate? On its own, it kind of sounds like similar anti-self-driving-car arguments.
yeah thats why I mentioned it works well IF guided by the correct expert.
I agree that you can create a set of domain specific rules, reinforcement layer validation tools, like self driving, that vastly improves the accuracy of au & llm's. Making humans less and less needed. But where LLM's comes from the magic of generic knowledge, this will be the opposite, narrowing it down.
I kinda disagree. High risk environments just means that they will have to have a human-in-the-loop for a longer time which drastically reduce the skill required for such human (which is still requires high skill just not stupidly high).
The employers will think it requires less skill, whereas in fact it might actually require more skill to do a good job of being the human-in-the-loop.
For example, my sister is a translator and she says that checking AI translations is actually harder in many ways than doing a translation in the first place, but the agencies pay less for checking than actual translation.
I used to do audio transcription and some video captioning. Found it a bit drudgerous and fatiguing in rather specific ways, but I was effective at it and could find some satisfaction in it. It's been some years now, so I haven't had a chance to try out the kind of thing they're doing now, but I'm pretty sure I wouldn't want to. I can raise my blood pressure just sitting here and thinking about what it would be like to have to go through a Word doc and correct the bot's errors. But, even putting aside my professional pride (or indignation), I can only imagine that it would make all kinds of mistakes I never would, and wouldn't be any help with the parts I'd have trouble with. And I'm pretty sure that, at least often enough for it to be an issue, the priming of reading what the bot thought something was could easily make it way harder to hear it correctly, if I notice there's something wrong in the first place. I assume there's a similar problem for your sister along the lines of throwing off how it would occur to her to express something in the target language.
Doesn't it increase the skill required? You need to be able to jump in at the perfect time, while waiting patiently for 99% of the time. It's like self-driving that requires you to "jump in" at the worst possible time (0.5 seconds from a crash), and stay put the rest of the time--but don't get bored or inattentive. The only way to do that would be to be so naturally good at the danger point that you can do it basically reflexively.
I think the opposite, only the most skilled will be required.
But it depends on the skill:
- For landing pages & simple saas solutions: marketeers & founders have more skill, since they understand the user best. The real skill is not the basic coding, but understanding the market.
- For security risks/architecture: senior devs can spot things in seconds
Im not a doctor or lawyer, but im sure there are cases where AI is really good in a similar way and cases where they miss the most crucial aspects.
> drastically reduce the skill required for such human
I mean thats what is wanted by some companies.
The problem, especially for things like legal is that it requires someone more skilled to read through and understand that the argument is bollocks, or the law/precedent they are banking on is in fact the right one.
We have a tool that auto-writes letters to our management companies when they break SLAs. We have a slider that goes from polite to we are going to extract your first born.
Thats simple ish to do for LLMs, and low risk.
Drafting contracts is also something we could probably do, as its mostly boilerplate. However the consequence for mis-drafting a contract is multi-million dollars.
Man, this comment made me think of a Kafkaesque future where two AI lawyers and an AI Judge are stuck in an infinite loop arguing over a case, meanwhile the defendant is running around trying to get anyone in the legal system to recognize that the AI is stuck.
If the human involved has no skill then they might as well not be there, since they're just a fall guy when things go wrong and won't do anything to prevent it from happening.
Yeah but even what you describe makes it an extremely useful tool and productivity boost. Sure, we're not going to deploy a lawyer agent with full autonomy and no more oversight than a real lawyer. But isn't it wild that's now the frontier?
It's not like self driving cars where better than a human 80% of the time isn't good enough and they aren't really usable until its 95%, 99% etc.
> the study where it beats our highest caliber of knowledge workers may have some methodological deficits
The point is that if the study can't validate the claims being made then we can't actually extrapolate from that claim. What you're predicting may or may come true, but the study (which is the topic at hand) isn't useful for supporting the assertion.
> Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.”
Assuming it keeps improving at the same rate, which I think we are already seeing not play out. If you compare the first six months when GPT truly hit the mainstream to the previous six months, the improvements are not nearly as evident. That isn’t to say they aren’t noticeable, I could definitely tell it’s improving, but not nearly at the pace it once was.
There’s also the fact that they can’t possibly keep improving frontier models at the same rate (I.e. training investment) when investment starts slowing down. The amount of cash being burned is completely unsustainable and you’re already seeing some pullback.
On the other hand we keep seeing only marginal generational imorovements in CPU space, yet performance gains over last 10 years in CPUs are very material.
Every new model might not be a leap like it used to be, but give it enough time and improvements add up.
Nobody is disputing that. I specifically said that I can see the improvements from the last six months. What I’m saying is we can’t assume that every two years it will improve at the same rate.
The further we get into this, the more AI feels like 3-D printing. Significantly bigger and will be more widely used for sure. But nowhere near the “new industrial revolution” that all these companies are making it out to be
Do you agree that economic and behaviour shift will be comparable to mobile and we are at the times of Nokia 3310. Does it count as industrial revolution?
I think that’s kind of a strange question/parallel that doesn’t have a concrete answer, partially because even the people making these tools don’t really know where it’s going to land or what the ultimate utility is. Hence why they’re begging all of us to figure out the billion dollar applications for them.
Ultimately they are clearly here to stay but I think they are going to be incredibly important in some industries and minimally present in others (a glorified chatbot/summarizing tool for instance). Whatever form it takes it’s definitely not going to be a model where individuals have subscriptions they pay for monthly.
> even the people making these tools don’t really know where it’s going to land
exactly my point to compare it with pre-iPhone mobile market: wide (and growing fast!) adoption, clear potential (WAP websites, J2ME games), many players in the game, some real market fit discovered already (Blackberry), influx of capitial and tinkerers alike, but still a lot of unknowns where it will ultimately land.
Even if no single improvement was revolutionary (even first iPhone was just a fancy phone without App Store), overall mobile made billion dollar industries possible, for better or worse, and changed the way we live. Counts as industrial revolution, comparable to the Internet itself in my eyes.
Everyone has one at home spitting out items they need daily/weekly like was promised. I don't know if you remember the 3D printer (somewhat) boom of the 2010's but the hype was crazy when it became more mainstream. Maker spaces popping up in cities everywhere, schools showing off their units, every conference had some talk on them, startups left and right. The AI boom is basically a more-funded version of that. It was hot hot hot and people thought every home was bound to have one.
The difference with AI is it affects all technology at the same time. 3d printing only affected manufacturing. What we're seeing now has impact in chemistry, medicine, software, and all other knowledge industries at the same time.
It took a while, especially because the early 3D printers were a project of calibration unto themselves, but modern printers are fairly trouble free. I accidentally melted the bottom of my blender jug on my toaster oven so I'm printing a replacement one right now. Turns out the critical mass needed is someone else having already done the CAD so I can just hit print from my phone, which makes 3D printing a reality.
The issue is that before GPT models basically were useless for any conversation. We are literally in science fiction realm. From a text conversation perspective the gap between where we are at and what’s left to get to is relatively small.
In my opinion, the main thing we need to do is have training happen continuously. And probably more real world data (from sensors).
Not necessarily. In many (most?) areas of tech the rate of advancement follows a logarithmic curve. That is to say, the first 90% is achieved quickly but the last 10% takes significantly more time.
The ELIZA effect has been around since 1966. I think lots of folks feel “AI” has advanced much more quickly that it really has because of the nature of its many past boom / bust cycles.
ELIZA has never done well in conversational tests. GPT-4.5, for example, tends to out-human humans. Like I could never ask ELIZA this question and get anything close to a decent response: "Give me three points that convey the impact that 9/11 had on rap music in the 21st century with some good examples?" Asking ChatGPT today gives me an answer that I'd give an A grade to a strong college student. ELIZA's response -- "What do you think?".
This is the hot button right here. Most of the advancements have also come at the cost of excess: exponential token use at the expense of marginal gains.
Context is still a large limiting factor, and we have band aids around that area already. And the further along we go the further distributed LLMs get in terms of additional pieces.
As for the original article and sentiment I'm sure AI will be a boon for law. It's going to be much easier for the general consumer / person / small business to represent themselves which feels like a win. The downside is I feel like we're tracking towards a digital hell of "virtual lawyers" that will be at the whim of any org. Consumer laws really need to change now to help avoid this dystopian path we're on.
I agree.
But notice that you assume that there is a metric with which you can messure improvement.
Which is fine if you are measuring against your personal taste.
But it might be that the optimization target itself has a ceiling. If you're training toward human approval ratings from a broad population, you converge toward what median preference selects for. The plateau is baked into what you're measuring against.
It doesn't even need to 'improve' at the same rate to have extraordinary impact in society. Even if the frontier models stayed roughly the same in cost and capability for just 1-2 years, the harnesses and processes built around them would mature. We have not yet metabolized these models. Frankly, a lot of this feels like late 80s early 90s complaints about how office computerization wasn't happening yet--it was, just not at the rate promised by the companies selling computers to businesses. We don't look back at those people in the 80s saying that paper was here to stay as visionaries just because they noticed that propaganda temporarily outran the business environment.
I just wish people would take a step back and think about the timescales here. Language Models are Unsupervised Multitask Learners was in 2019. Here we are seven years later and LOOK AROUND. The landscape is unrecognizable. It's worth thinking about who, in those seven years, had an accurate estimate of the future and whose estimate fundamentally failed. And just as it is valuable to note where propaganda about progress speeds past where we are, we should remember that it is costless to announce that at some unspecified future time all of this will settle down and things will go back to the way they were.
> I just wish people would take a step back and think about the timescales here. Language Models are Unsupervised Multitask Learners was in 2019. Here we are seven years later and LOOK AROUND. The landscape is unrecognizable. It's worth thinking about who, in those seven years, had an accurate estimate of the future and whose estimate fundamentally failed. And just as it is valuable to note where propaganda about progress speeds past where we are, we should remember that it is costless to announce that at some unspecified future time all of this will settle down and things will go back to the way they were.
People can understand all this and still disagree with you.
What if the methodological deficits are actually causing the paper to underestimate the quality of the AI responses? Why assume any deficits would bias the AI's competence upwards instead of downwards?
More than that, the entire structure of the study is pointless. They set up as a question/response and then had humans rate the response. That's literally what LLM's are trained to do, which ultimately is convincing a human to click the "I like this one better" button on it's response.
LLMs are trained to convince a typical human to click the "I like this one better" on their response.
Convincing a human law professor to click the "I would prefer to deliver this response to a student" button, and to not click the "this response is pedagogically harmful" button is a different task!
I could imagine an LLM convincing a typical human to click the "I like this one better" button with flattery, or with nice-sounding platitudes, or with hand-wavey explanations that sound plausible. And in fact that's exactly what LLMs do when they go wrong - they bluff and output superficially plausible nonsense!
But these weren't typical humans, these were law professors specifically tasked with deciding which response was a better option to give to students as a canonical answer to a contract law question. So I think this is a genuinely impressive result.
IRDC if the LLMs "understand" anything. They are being used here to produce outputs that are desirable. (Neglecting the real possibility that this "survey" is complete BS, as noted elsewhere.)
This is kind of like saying you can't compare Computer Vision models to Human performance because those models were literally trained to identify objects in images...
I'm not saying you can't compare them, I'm saying it's pointless. LLM's are extremely large scale multivariate regression machines, evaluating it's output within it's own training domain is as pointless as seeing if a ball rolls downhill.
I think your 3k figure comes from here - It is explained:
> As judges, the professors then completed 2,918 blinded, forced-choice comparisons (median per judge: 200), each time indicating which of the two anonymized responses, from the instructor or the LLM, they would rather give to a student
The study was conducted by Stanford’s HAI institute, which receives heavy funding from Google (how much I couldn’t find because they don‘t publish their donations in a place I could find it; but I suspect it is alot). And the authors did not declare a non-conflict of interest at the end of the paper.
Wait, where are you seeing the link to HAI? TFA mentions something called "liftlab" which seems to be something under Stanford Law School and separate from HAI. The study has more than a dozen authors from as many different universities but HAI is not mentioned.
The leader of the study, Julian Nyarko, is Associate Director and Senior Fellow at HAI. I can't say whether that means the study was conducted by HAI, but there is at least a connection to it. https://hai.stanford.edu/people/julian-nyarko
You are right, this study was technically conducted by The Stanford Law AI Initiative which is co-chaired by Julian Nyarko who is also a senior fellow at HAI, and is also the lead author of this study.
This is enough of an association to claim a conflict of interest between the study authors and Google. But I wanted to go further and see if The Stanford Law AI Initiative had been given a research grant from HAI. So I spent way to long on both of their websites to find a list of research grants either awarded by HAI or received by Stanford Law AI Initiative. But no such luck. Despite HAI having a page dedicated to Centers and Labs, and to Research partners, and despite claiming 500+ research funded, they only list like 6 organizations each, and then link to each other in their “See More” button below.
I have a feeling I will have to browse through some tax filing papers to find the truth here. But I am not a journalist, so I am not gonna. I am simply gonna leave it at the obvious associations involved here. And maybe issue a correction: “conducted by a senior fellow at HAI”
When they are studying a consumer product it is pretty customary to declare a non-conflict of interest. So yes. Declare it at the end of your paper please.
Unless you have a conflict of interest, in which case declare e.g. “the lead author of this paper is an Associate Director and Senior Fellow at HAI which receives funding from Google the company which makes Gemini”.
People don't always have the resources to conduct massive "proper" studies. We live in the real world, and have to settle for what studies people can conduct.
Not saying we should take such studies as the "gospel truth" ... but if you ignore them and only consider "proper" studies, you'll be waiting a very long time to learn anything new.
You are saying the companies that are planning to build structures the size of Manhattan, while claiming multiple trillions TAM, and eventual apotheosis, along with the consumers of these models can't scrape together enough coins to fund a study with a decent statistical power?
These studies are often conducted by the AI companies them selves (in this case, an institute that receives funding from AI companies), if they were interested in the truth (which they obviously are not) and not propaganda (which they obviously are) they would fund the necessary research. AI companies have plenty of money and can well afford to do this properly.
Other then AI companies, a more realistic option are state funded universities (particularly in Europe and east Asia) which have consumer protection agencies who’s purpose is to protect their residents from corporate greed, and as such should fund, commission, or even conduct such studies. They also have enough money to do this properly.
If there is enough money for propaganda, there should also be enough money for the truth.
I find it entirely likely that the preference for the AI generated answers is entirely due to the confidence of its assertions. Given the numbers of evaluations each prof had to do, there’s no way they researched the answers thoroughly. But if there’s one thing we all know LLMs can do well, it’s to generate text that sounds extremely confident. And that signal is appealing in choosing which of two statements you’d give to students.
> There's also really clear bias given that the main results only feature Google models.
The main results also don’t seem to know what a “model” is, as the two “models” it refers to are “stock Gemini 2.5 Pro” and “a retrieval-augmented version of NotebookLM”.
One of which is a model, and the other of which is an interface backed by different models depending on exactly when the analysis was performed.
But does it really matter? It seems fairly obvious that AI is going to outperform professors. While the studies run, there are three more model releases that change the calculus entirely. I wonder how much we are learning with these studies about what is going on.
Doing things that are well meaning, but ineffective is not great policy. The simplest alternative to doing things that don't work is always not doing them. Better ideas are of course welcome, but not required.
I don't think that's how science/academia works. There is no such thing as a perfect study, there are always non-idealities and noise in the data. Good studies make well-justified efforts to account for these, OP is saying they don't believe this is the case here.
Regardless, your assertion that "oh well, the models will be totally different in a few months anyway, therefore any study done today is pointless" seems more than a stretch. How do you know they will be so different? How can you verify that today's studies are completely irrelevant?
I never get the same answer from any two lawyers. I hate law as a result. With developers you might get disagreements based on experience, but there's usually a strong consensus on specific things, with lawyers and courts its all over the flipping place. I wouldn't be surprised if LLMs can "pass" on paper (ie college exams) but in practice, they might 'struggle' in different courts.
...On the other hand, if an LLM has access to every transcript of every case a Judge has overseen, they might have an unfair advantage in any case... Hmmm...
This all assuming the AI lawyer doesn't hallucinate and start referencing cases that don't exist.
I now foresee a future where law firms have models trained on all the transcriptions of individual judges, lawyers and prosecutors, and run agents against them to decide on the optimal strategy for a case.
Agree, though I've also heard from a lawyer to be very careful trusting an LLM for legal advise, and I believe them because the law is insanely nuanced (they disagree with me on this) just talk to a room of lawyers about what should be "simple" clean cut legal issues, and they might ALL disagree based on nuanced reasons and personal experiences with cases.
> I think this is probably true for most skilled professions.
I agree, BUT I also find that it's easy for experts to atrophy quickly. When the AI is right 80/90% of the time it lulls you into over confidence.
I find those that are best and make the greatest use are the ones who remain skeptical but also use the tool. The same people who were already nuanced and picky before AI. The same people who already doubted and questioned their own work, and used that suspicion to help prevent them from having over confidence in their own work. If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
(To be clear, I'm not saying perfectionists. Some might call them that because the picky people have higher standards, but a good expert has to also understand that perfection doesn't exist. That's often a driving force in the suspicion! This also tends to cause them to continually improve)
I would agree with this point and as I explained in a comment replying to the GP comment above, that atrophy is far more dangerous in the legal field than it is with code because legal documents do not benefit from the structural safeguards available for code, like automated testing, static typing, static analysis tools, etc. IME with legal LLMs so far, they are easily in that most dangerous valley where they can lull you into a false sense of security while still introducing extremely dangerous mistakes that are frequently difficult to detect without very careful reading.
The danger of those mistakes creeping in also grows exponentially the farther a lawyer strays from their core legal expertise. There are a few statutes I know inside and out, and I can spot LLM analytical errors related to them in a split second, but once I venture out into domains where I am not an expert (but where I am nevertheless reasonably qualified to practice), it becomes much harder to spot drafting mistakes because I have not refreshed my own understanding of the law by reviewing the relevant cases or statutes as I would when drafting the analysis myself from scratch.
> I agree, BUT I also find that it's easy for experts to atrophy quickly. When the AI is right 80/90% of the time it lulls you into over confidence
Thinking the AI is right 80/90% of the time is already a sign of being lulled into overconfidence. The actual percentage is much lower in my experience. I'm willing to grant the AI is "somewhat right" that often but is that really what we settle for?
Am I secretly the only person who ever actually cared about being very accurate. Is AI just an excuse everyone else is using so they can stop pretending? This is so incredibly frustrating
> If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
If you are willing to do that with your own code you should probably not be trusted to work on software
But reverse engineering isn't as easy as my snarky response implies. But I do think more of us should get into hardware hacking. It's the only way we have to fight back. I'm tired of this "own nothing" paradigm and being forced into whatever dumb thing they want is to do. And it's so dumb too. There's not many power users but there's a disproportionate amount of resources dedicated to fucking us over
reply