Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin

I like how everyone laughed when OpenAI said their models will have "PhD-Level Intelligence" and now the goalpost has been moved to if AI can create new math (i.e., not PhD-Level, but Leibniz/Euler/Galois level.)


As a mathematician, new, conceptual math is when I'll become interested in reading LLM output.

I appreciate very much the work done so far, but this sort of asymptotic/quantitative result didn't interest me much even when it was done by humans.

(This is not snobbery, just a personal preference.)


I have no idea about research in mathematics: How will mathematicians judge what constitutes new conceptual math that is actually useful, vs a hallucination that might be novel but doesn't introduce anything actual meaningful?


It’s basically up to the domain experts. What I found interesting in mathematical optimization/combinatorics (my fields of interest) when an AI proved some major results some time ago was probably dismissed as a boring fact by someone else. What OP is mentioning is just their personal preference and doesn’t reflect the actual opinion of the mathematical world.


> What OP is mentioning is just their personal preference and doesn’t reflect the actual opinion of the mathematical world.

Indeed, I never claim that my idiosyncrasies represent math at large.


Same as when humans do it.

Human mathematicians frequently introduce new pointless abstractions just to churn out papers. And they are not accepted in serious journals, but they sometimes find a place in some mediocre or bad journal.

Of course, AI will increase this phenomenon manifold.


Well that's coming.

As a matter of fact more logic and structure to your work, the more easy it is for AI to conquer it. Due to this programming was the first thing that got solved, but pure sciences are next.

If what you do, and how you do can be written down on a piece of paper, then AI can do it.

I do believe programming getting solved will be double assault on these fields.

>>This is not snobbery

This is good for the species, what sense does it make to keep treating these fields like they are reserved for the top most intelligent micro percentage of humans? Getting LLM to these things gives some scale to these subjects and thats good.


> Well that's coming.

So is AGI, but we may be hundreds of years off still.


No it is not Leibniz/Euler/Galois. More like writing good papers that contribute to the broader understanding of a theory. I think if one evaluated a mathematicians research output and it consisted of mostly the kinds of problems AI has solved so far, it would give the impression that this person is somehow very good at picking accessible problems to target, but has not made a larger impact on the field.


The goalposts are Euler level not the current model capabilities.


Yes what I was saying is what I believe about the goalposts.


What were you imagining when OpenAI said that their models would have "PhD-Level Intelligence"? Were you imagining that there were specific tasks they could do that were on par with what a human with a PhD could do? Because by that definition, many computer tools have "PhD-Level Intelligence". By that definition, Wolfram Alpha has "PhD-Level Intelligence".

What I assumed they were saying is that their LLMs would be as intelligent as a human with a PhD across all, or at least most, knowledge tasks, and they clearly are not.


My only complaint is the claims always start spreading 6-12 months before the delivery. A little patience goes a long way in what's possible with AI and we all just have to wait and see what parts actually grow this next cycle or not. Guessing at it based on trend lines only leads to people getting excited when it matches their particular guess and ignoring it when it doesn't.


>> OpenAI said their models will have "PhD-Level Intelligence"

> My only complaint is the claims always start spreading 6-12 months before the delivery.

If delivering on such promises "always" occurs 6-12 months after the promise, is that pretty good?


Again, the promise isn't _always_ delivered, people just more often focus on when the particular result aligns with their view. When it is though, it's all too commonly 6-12 months later. Which is nice but a bit annoying - why not wait 6-12 months and claim when you can actually show it? Or just say that's where it might be soon instead of talking like it is now.

I generally like AI and use it plenty often, it does many things well and I'm curious to see how far it keeps going, but that doesn't mean I have to like overhyped marketing about it.


They said that their specific version of model had has this ability one year ago.


I can't open all of the links in the article because of some Cloudflare issue (perhaps related to me being on a plane) but is the version and sub iteration of the model they actually used for this the same as the one they announced the capability on a year ago? If so, did they comment why didn't they just show this a year ago (they seem to have been publishing successively better results slowly instead).


In my memory, it was OpenAI o1. OpenAI o1 was released in September 2024. It seems the difference came from the difference between benchmark performance and propagation.


PhDs used to mean publishing a novel mathematical result: when has that changed?


My good sir or madam, disproving a decades-old conjecture produced by Erdos that has had armies of people in that field have their go at it IS a novel mathematical result.


They mean "new math" in the sense of more than a novel mathematical result, a new math paradigm or so.


Thats coming too.

Some times when you go some distance with a subject generates data for new ideas.

Once math gets done fast, newer ideas and paradigms also arrive.


The gap between novel result and "new math" is as wide as the pacific ocean.


So finally they reach a part of PhDs level. Current version rely human to integrate results from they model and writting the papper. If LLMs/AIs can do all thing above, we can exactly get a PhDs level model.


Not denying that these advances are impressive, but it is important to consider that this is a cherry-picked result. This doesn’t mean that AI can now be expected to do problems of similar or lower difficulty, but that it happened to work well on one problem. What you won’t see is how many others they had to try to get this result.


Earlier of their systems have solve other Erdos problems that people had worked on, this one was more monumental and had had a lot more prior effort that didn't solve, but this isn't a one-off.


This is true, but I still think the relevant question is, how many did they try before they found one that yielded to LLMs? The conclusion is very different if they tried 100 open problems and succeeded at one.


Yeah, maybe it's just the Texas Sharpshooter Fallacy basically, but with AI.

And if it isn't, we should find out very soon. If AI has got so good as OpenAI's post implies, then we should soon see a veritable blooming in the production of mathematical results, by lay people no less. No mathematicians needed! OpenAI say that their secret LLM solved the planar unit distance problem "autonomously" and the companion remarks say it one-shotted it; and while the companion remarks make it clear that there was a lot of refinement and improvement work done by humans, everyone seems to agree that the AI did the job by itself.

If that's true, if we're really at that level of autonomous mathematical reasoning ability, then we should see hundreds, even thousands, of open problems suddenly solved in a matter of years if not months. We'll just have to wait and see.



Yes, as some of these are being solved by the same person, I think my point is even more relevant: you try 1000 problems and solve a few, and only report the few, and it just seems like a matter of time until the rest are solved. But if you report that it didn’t work on the others, your conclusion is different.

I think it is important to temper expectations in light of the fact that these announcements are coming from a startup company with shady values looking to imminently IPO, and thus represent the most biased and misleading take of the situation possible.


Is that 326 solved? As per my comment above?

>> If that's true, if we're really at that level of autonomous mathematical reasoning ability, then we should see hundreds, even thousands, of open problems suddenly solved in a matter of years if not months.

Stressing "hundreds, even thousands".


No, problem #326, you didn't give a few days timeline.

Google released a paper about solving 9 more Erdos problems for an average of $100 each:

https://arxiv.org/html/2605.22763v1

In a year I think we'll probably have seen hundreds of open problems solved, even if there is some a low hanging fruit exhausiton bottleneck.


Then we should wait a year and see what happens.


It would be more impressive if it wasn’t behind the closed doors of a rich company. For all we know, they could’ve paid some mathematicians to work on the problem and pretend that their results are from ChatGPT.


Yet it still codes like a junior developer that memorized all of stack overflow.


PhDs code like that too. Especially if they're statisticians :)


Even if the code was like that (it isn't), the power of the current crop of models to analyze data for patterns and build context out of code is leaps and bounds what it was even a year ago. And any developer will tell you that the hardest part of fixing a bug is knowing where the bug is in the first place. Once you know where it is, fixing it is usually trivial.

There is serious magic happening in the construction of model context.


Personally I don't find this to be true anymore! It's not always great and does still will often tend towards unneeded complexity (especially if not pushed a bit), but I often find GPT 5.5 writing code I would have written myself. This was very much not true with earlier models (who make something that worked, but I'd always have to rewrite to make it "good code").


Personally I found 5.5 a massive step back from 5.4. Both of them still use way too many fallbacks and unnecessary checks, especially if you're having it output php. It's fine if you're just one person and checking everything and able to catch and correct. But it's really bad when you have a team all using it, not checking the output and trusting it's output leading to spaghetti code. Technically works, but very messy and will no doubt lead to buggy code.

It still writes like a junior dev, in that despite AI being able to get a picture of an entire repo, it's changes are typically confined to the task it's working on and will opt to duplicate logic to keep changes contained. Again, technically works, not ideal.


Yeah, it has a tendency to default to "smallest local hack that will work" and code as defensively as possible.

BUT I have had great success using AGENTS.md and becoming better at prompting to get it to not be like this.

Basic approach in AGENTS.md: don't code defensively, yada yada, we have a validation layer at X, no need to check for anything behind that layer. Works well.

An approach I've found helpful when prompting: What would be the best architecture for this change? If you say "do X" it'll tend to just do the hackiest, shortest path thing. If you say, "what's the best way to do X?" it will think more holistically.

That said, who knows, maybe when it's PHP it just really wants to hack ;-)

(Also, yes, you still need to review the code -- it will still do stupid things, so you can't just be pure hands off w/o ending up with quality degredations. The same is true of humans too though in my experience...)


Idk man, I think at this point, if you can't get good code out of frontier models, you're doing something wrong. Plenty of resources out there for you to familiarize yourself with the workflows if you can be bothered.


100%. this is what I tell people who fail at using these models. the model isn't the problem anymore.


What is the last model you used... lol. Linus Torvalds himself said the newest models are better than him at coding.


This doesn't sound correct. Source?


In recent months, Linus said it specifically about code for a personal side project of his. The quote was in the commit message. (I’m not the grandparent commenter, and I think grandparent commenter’s claims may be too broad or require context.)


There are a missing the context: The vibecoded application was written in python while the main code was written manually in C by Torvalds in this side project. He never ever said that AI produces better code than him in the language where he is proficientI.


https://github.com/torvalds/AudioNoise

> The python visualizer tool has been basically written by vibe-coding. I know more about analog filters -- and that's not saying much -- than I do about python. It started out as my typical "google and do the monkey-see-monkey-do" kind of programming, but then I cut out the middle-man -- me -- and just used Google Antigravity to do the audio sample visualizer.


this is not true at all. I'm using Opus and it's great at very complex problems.


Not true anymore since like early 2025 and especially since last December.


Clearly you've never supervised junior developers.


That's literally my job...


>That's literally my job...

Since you’re not in a unique position, I can confidently state that your comparison of LLMs to jr developers seems unfounded. Today, LLMs produce code that is superior to junior developer code by an order of magnitude.

Notably, they demonstrate consistent syntax, clear separation of concerns, strong test coverage, organizational rigor, idiomatic API usage, and the ability to generate and maintain documentation, among other measurable qualities.

LLMs generally operate at a staff engineer level for a number of languages and ecosystems (including polyglot projects).


I'm not sure what your background is, but as a staff level engineer, I can assure you they do not. They in fact seem to lack any understanding of architectural intent within a sufficiently large code base. This seems obvious since they can't fit the entire code base in their context at once.

We have many folks (not engineers) at our company using LLMs to open PRs, and every one of these PRs has profound architectural design problems.


> They in fact seem to lack any understanding of architectural intent within a sufficiently large code base. This seems obvious since they can't fit the entire code base in their context at once.

This is a critique of scale, moving the goalpost.


Nonsense. The goalpost is "this is as good as a senior engineer". A senior engineer can easily understand architectural rationale. Don't dismiss my argument because it's inconvenient to yours.


LLMs absolutely do not exceed the abilities of junior devs. They don't even meet that bar, let alone exceed it. Junior devs are capable of getting syntax right without someone going "hey you messed that up". LLMs are not. Junior devs get basic logic right. LLMs do not.

Comparing an LLM to a senior developer is an absolute joke.


1. Which LLM are you using that is “not capable of getting syntax right”?

2. Are you referring to without having a compiler or LSP check it? Although even then, the recent LLMs I've used still frequently get syntax right, whereas I'd expect juniors are often using a LSP or compiler to catch mistakes while writing code?


What model are you using? Llama 3.1 8b? This has not been true for years.


Well, in another comment he said that LLMs haven't improved in 3 years. So this puts him at Llama 1 7B.


they 100% exceed even medior developers.

who cares about syntax? who cares about iteration? what I care about are _results_, which they can produce at the end. do you check your human colleagues how many iterations they do before committing/showing their work to anybody? no. why should you set such a bar then for your LLM?


Or PhDs


What's laughable is an OpenAI employee invented the term "PHD level intelligence" and you think that " PHD Level intelligence" is a real term that describes a real thing and you are repeating it here.


It's clearly smarter than any PhD and dumber than any ant.


[flagged]


Thinking you're magically smarter than others is indeed an essential part of the NPC trend, to the extent that it in itself becomes an NPC thing to say.

It's pretty much a 1:1 match to the "we're all unique snowflakes" meme, with an army of Buzz Lightyear toys repeating the same in the background.


I still laugh.


Have you updated your priors after this announcement? If not, why not?


Yes let me calculate the exact change it’s 0.004748394 probability now based on my own made up statistical vibes that I feel


I don't have enough information about the announcement for it to mean much to me. I don't know much about this field of maths. I don't know how many mathematicians were actively working on this problem. It could be zero, which would indicate it's not really that interesting. The article gushes about how it's a Very Important Problem, but it's not even mentioned on https://en.wikipedia.org/wiki/List_of_conjectures_by_Paul_Er.... I'm sure the busy folk at openAI will fix that soon however. Furthermore the extensive dishonesty of companies like openAI makes me suspicious of just how this was achieved. Overall the announcement is of little interest to my "priors", although I don't typically think in such terms.


It is extremely well known. Lots of people have tried to solve it and it stood basically stuck for 80 years. It is getting harder every day to downplay these models.

Given its elementary nature (very easy to state), you can bet that a lot of very bright people have worked on it (I know of one MIT graduate who specialized in Geometry had a lot of interest in it).


I don't believe the result at all. I think it contains faulty logic. Perhaps the mathematicians involved can read the tea leaves and decide something interesting happened, but all this AI psychosis bullshit still refuses to accept that AIs do not, and cannot, have a mental model of the world.

Moreover, model output is incredibly good at looking credible but being wrong. It has NEVER produced something correct for me in a field of which I am an expert without some external oracle to validate claims (like e.g., Lean)


At this point the term "AI psychosis" is the more apt label for AI skeptics. Here we have literal Fields medalists vouching for correctness and relative importance of the result, but who cares, "I don't believe the result at all". Just pure denial of reality.


You should believe that the proof works at least as much as any ither paper in mathematics. The proof has been scrutinized by experts and simplified and improved. If you don't believe that then I'm sorry but you are deluding yourself.


You don't have enough knowledge to dismiss them, but you still laugh? For?


Do you have enough knowledge? I laugh at everyone who accepts these claims in the light they're presented despite knowing so little.


The GP said "I like how everyone laughed when OpenAI said their models will have PhD-Level Intelligence", and you said you still laughed, so I just wanted to confirm if you did laugh at that. Apparently you did not. Thanks for the confirmation. I think you should not, given your admittedly limited understanding.


You don't know the names of the mathematicians who've given their thoughts on this? If not, you really should just not comment on anything mathematical ever again.


I do know their names. However I'm not in the field and there are many cases in recent years of high-profile scientists putting their weight behind highly dubious claims. Thanks for the advice, by the way.

Note that I'm not disputing the validity of the counterexample itself.


That's fair. If you're familiar with mathematics culture though, you'd know that "LLM hype" is not really in their blood and is certainly not something that gains you PR points. I think it's safe to take their comments at face value. I do think the ice is beginning to thaw though and perhaps in the next few years, there will begin to become more of a hype phase in math if some really high profile problems begin to fall to AI, although one might argue at that point that the hype would be deserved.


Doesn't make much sense, does it? If I accept that I don't have enough information on something, then I withhold judgement. There's nothing so reserved about mockery and cynicism. You're not cautious, you outright hedge that it's all a lie, and paint everyone else to be a complete idiot for thinking at all otherwise.

The world runs on trust, specifically trusting expert advice. It'd seem that due to resource constraints and scale, that's the best available option. By extension, there should be absolutely nothing weird or surprising on people following suit. It's why these companies themselves rely on expert counsel, and defer to their appraisals for marketing. The opposite is what's weird and unusual, and what requires more substantiation.

It's interesting that those who come out swinging against "trusting the experts", or really, trusting anyone else but them, not only ~never acknowledge this, but are seemingly outright proud of it, considering it as their own unique little trait, egocentrically revelling in it. It's almost as if epistemic rigor and truthfulness was not their actual concern.

Woohoo, I'm distrustful and cynical. Behold my unfathomable wisdom! Bonus points if they're also hurtful, because flipping the arrow on "hard truths -> hurt feelings" is a masterclass in reasoning too, of course.

I can appreciate faulting experts and organizations for misusing people's trust, and looking out for this angle, but given how unavoidable and fundamentally useful trusting itself is, blaming people for defaulting to trusting makes no sense to me whatsoever. It comes across as just the usual trope of blaming the individual. If you're from a lower-trust culture / environment, I can appreciate why you'd have a more distrustful default disposition (and why people might come across as suckers), but the principle still holds.


The problem was pretty well known, and had many human attempts. There's some room to argue that the right humans hadn't attempted it, as the solution used advanced methods from another field of math. But imho, whereas many prior AI victories could be explained by not enough human attention, there is no such excuse in this case, and one should acknowledge this is a notable achievement.


Prior whats?



When a qualifying noun is absent , then priors means prior beliefs.


Prior to what? Why not just say beliefs?


"Update your priors" is a common expression in English: https://en.wiktionary.org/wiki/update_one%27s_priors#English


Your wiktionary link indicates it is not a common expression in English but instead something "rationalist community" people say.


HN is a rationalist community hangout.


we're reading comments on a post about math proofs


No it's not. Where do you come up with this? Just because you searched the phrase on Google and there's a single result for it on a wiki? Who do you know that's using this expression regularly?



"Common" is an exaggeration.


And the goalposts will keep getting moved all the way to the singularity. And then those people will/would say "Oops. I was wrong."


large language models do not have pigeon-level intelligence. They can't even feed themselves.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: