So I am extremely hyped about this, but it's not clear to me how much heavy lifting this sentence is doing:
> First, the problems were manually translated into formal mathematical language for our systems to understand.
The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?
The computer did find the answers itself. I.e., it found "even integers" for P1, "{1,1}" for P2, and "2" for P6. It then also provided provided a Lean proof in each case.
It would make a lot of sense for the lean-code-formalisation of the problems done by the researchers fed to the AI to be provided. Not assuming bad intent in not providing them, but it would help understand better the results.
formal definition of first theorem already contain answer of the problem
"{α : ℝ | ∃ k : ℤ, Even k ∧ α = k}" (which mean set of even real numbers).if they say that they have translated first problem into formal definition then it is very interesting how they initially formalized problem without including answer in it
That wasn't very nice. Are you curious about anything? Happy to help. I'd proactively do it, but I don't want to guess at whats in your mind. My initial guess is you think I think that engaging with the public is an infinite loop. I don't!
I think maybe parent comment is referring to it essentially just employing a zerg rush but with the speed and reaction time of an AI?
Not 100% sure...
Unrelated, iirc the starcraft functionality was an early example of generalizing a pretrained NN, alphaGO, and showing that it could adapt to learn and defeat games across strategic domains, especially after it learned so much strategy from the most difficult, widely played, and most strategically-varied physical game available.
Exactly, a problem and its answer are just different ways of describing the same object. Every step of a proof is a transformation/translation of the same object. It would be disingenuous to say that some heavy lifting isn't done in formalizing a problem but it seems that step is also performed by a machine:
"We established a bridge between these two complementary spheres by fine-tuning a Gemini model to automatically translate natural language problem statements into formal statements, creating a large library of formal problems of varying difficulty."
I'm confused, is the formalization by Gemini or "manually"? Which is it?
"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."
Huh, so MCTS to find the ‘best’ token using a (relatively) small, quick language model? Sounds like an interesting approach to small model text generation too…
Yeah I am not clear the degree to which this system and LLMs are related. Are they related? Or is AlphaProof a complete tangent to CHatGPT and its ilk?
It's a math Language Model. Not even sure it's a Large Language Model. (Maybe shares a foundational model with an English LLM; I don't know)
It learns mathematical statements, and generates new mathematical statements, then uses search techniques to continue. Similar to Alpha Go's neural network, what makes it new and interesting is how the NN/LLM part makes smart guesses that drastically prune the search tree, before the brute-force search part.
(This is also what humans do to solve math probrems. But humans are really, really slow at brute-force search, so we really almost entirely on the NN pattern-matching analogy-making part.)
These kind of LLMs are also very interesting for software engineering. It's just a matter of replacing Lean with something that is more oriented towards proving software properties.
For example, write a formal specification of a function in Dafny on Liquid Haskell and get the LLM to produce code that is formally guaranteed to be correct. Logic-based + probability-based ML.
My reading of it is that it uses the same architecture as one of the Gemini models but does not share any weights with it. (i.e it's not just a finetune)
This is really interesting. I would have expected the understanding to be that humans make a guess, test it, and learn from what did or did not work. The lessons learned from the prior tests would impact future guesses.
Do you know if a system like the OP is learning from failed tests to guide future tests, or is it a truly a brute force search as if it were trying to mine bitcoin?
This quote from the article sounds like it learns from failed tests:
>We trained AlphaProof for the IMO by proving or disproving millions of problems, covering a wide range of difficulties and mathematical topic areas over a period of weeks leading up to the competition. The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
Reading between the lines a bit, that does answer the question I had though don't think I I clarified very well.
I read that to say the model's token weights are adjusted as it goes, so in an LLM sense it is kind of learning. It isn't reasoning through an answer in the way a human does though. Meaning, the model is still just statistically predicting what an answer may be and checking if it worked.
I wouldn't chalk that up to learning at all. An AI solving complex math doesn't even seem too impressive to me with the predictive loop approach. Computers are well adept at math, throwing enough compute hardware at it to brute force an answer isn't suprising. I'd be really impressed if it could reliably get there with a similar number of failed attempts as a human, that could indicate that it really learned and reasoned rather than rammed through a mountain of failed guesses.
>with a similar number of failed attempts as a human
I'd be hard to know how many failed attempts the human made. Humans are constantly thinking of ideas and eliminating them quickly. Possibly to fast to count.
Ive never competed in math competitions at this level, but I would have expected it to be pretty clear to the human when they tested a different solution. As complex as the proofs are, is it really feasible that they are testing out a full proof in their head without realizing it?
Hmm, I think it comes down to what the definition of "testing" and "attempt". A human will generate many ideas, and eliminate them without creating full proofs, by just seeing that the idea is going in the wrong direction.
It sounds like AlphaProof will doggedly create full proofs for each idea.
yeah but it doesn't understand the exact syntax on an absolute level, does it...? I understood this to be the same as any language model applied to programming languages (aka Formal Languages). Is that mistaken?
As far as I understand, and I may be wrong here, the system is composed of two networks: Gemini and AlphaZero. Gemini, being an ordinary LLM with some fine-tunes, only does translation from natural to formal language. Then, AlphaZero solves the problem. AlphaZero, unburdened with natural language and only dealing with "playing a game in the proof space" (where the "moves" are commands to the Lean theorem prover), does not hallucinate in the same way an LLM does because it is nothing like an LLM.
Yes, but the problem space means that invalid outputs can be quickly identified - whereas general programming isn’t necessarily amenable to rapid checks.
I mean, aren’t you just describing formal language syntax? Seems like a fundamentally similar situation —- the computer can automatically flag any syntax errors in a millisecond by checking it against the generating grammar for that language. Thats what makes a formal language in the first place, I think!
I do think this language is considerably more robust than the typical programming language, which means a sound program is more likely to end up being valid/“correct”. But still, that’s a difference of degree, not kind, IMO
I don’t mean syntax errors - I mean the difficulty of validating code that contains side effects (like http requests, database access etc).
Validating a math proof either terminates in a reasonable time (in which case it’s useful for training), or does not (in which case the AI should be discouraged from using that approach).
To speak generally, that translation part is much easier than the proof part. The problem with automated translation is that the translation result might be incorrect. This happens a lot when even people try formal methods by their hands, so I guess the researchers concluded that they'll have to audit every single translation regardless of using LLM or whatever tools.
> However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.
So didn't actually solve autoformalization, which is why they still needed humans to translate the input IMO 2024 problems.
The reason why formalization is harder than you think is that there is no way to know if you got it right.
You can use Reinforcement Learning with proofs and have a clear signal from the proof checker.
We don't have a way to verify formalizations the same way.
> However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.
A small detail wasn't clear to me: for these incorrectly formalized problems, how do they get the correct answer as ground truth for training? Have a human to manually solve them?
(In contrast to problems actually from "a huge database of IMO-type problems", they do have answers for these already).
> A small detail wasn't clear to me: for these incorrectly formalized problems, how do they get the correct answer as ground truth for training? Have a human to manually solve them?
Formal proofs can be mechanically checked if it's correct or not. We just don't know what's the answer. Think it as an extremely rigorous type system that typically requires really long type annotations, like annotation itself is a complex program. So if AlphaProof happens to generate a proof that passes this checker, we know that it's correct.
One more trick: They look for both proofs and disproofs. So even if they failed the formalization and created a "wrong" theorem, it's just another task.
You write proofs in a formal language that can be machine checked. If the checker is happy, the proof is correct (unless there is a bug in the checker, but that is unlikely).
They said the incorrectly formalized ones are usually easier, so I assume they just hire humans to solve them in the old way until the AI is smart enough to solve these easier problems.
An incorrectly formalized problem is a different problem and a solution to any formalized problem still useful for AI training because such solutions can be mechanically checked for correctness and this does not require the hire of humans. What requires humans is the initial formalization process since that is more a language translation task which requires nuance and judgment and is difficult to mechanically verify.
> We don't have a way to verify formalizations the same way.
While there is no perfect method, it is possible to use the agent to determine if the statement is false, has contradictory hypotheses, or a suspiciously short proof.
> To speak generally, that translation part is much easier than the proof part.
To you or me, sure. But I think the proof that it isn't for this AI system is that they didn't do it. Asking a modern LLM to "translate" something is a pretty solved problem, after all. That argues strongly that what was happening here is not a "translation" but something else, like a semantic distillation.
If you ask a AI (or person) to prove the halting problem, they can't. If you "translate" the question into a specific example that does halt, they can run it and find out.
> While the problem statements were formalized into Lean by hand, the answers within the problem statements were generated and formalized by the agent.
However, it's unclear what initial format was given to the agents that allowed this step
FWIW, GPT-4o transcribed a screenshot of problem 1 perfectly into LaTeX, so I don't think "munge the problem into machine-readable form" is per se a difficult part of it these days even if they did somehow take shortcuts (which it sounds like they didn't).
… no? After the LaTeX output, I told stock GPT4o that the answer was "all even integers", and asked for the statement in Lean. I had to make two changes to its output (both of which were compile-time errors, not misformalisations), and it gave me the formalisation of the difficult direction of the problem.
Both changes were trivial: it had one incorrect (but unnecessary) import, and it used the syntax from Lean 3 instead of Lean 4 in one lambda definition. A system that was trained harder on Lean would not make those mistakes.
The one actual error it made was in not proposing that the other direction of the "if and only if" is required. Again, I am quite confident that this formalisation failure mode is not hard to solve in a system that is, like, actually trained to do this.
Obviously formalising problems that a working mathematicican solves is dramatically harder than formalising IMO problems, and is presumably way ahead of the state of the art.
This is really not the kind of problem LLMs are bad at! But since you insist, given the LaTeX, Claude 3.5 Sonnet correctly stated the theorem in full while inventing notation for the floor operation (it did correctly note unprompted what the right function was and how to obtain it from mathlib, but it incorrectly attempted to define syntax sugar for it).
The hard part isn't getting the formalisation right sometimes, it's getting it right reliably (and unlike mistakes in the formal part, there's no way for the system to check itself in that part).
IDK, even for translation between languages outside of mathematics, missing small qualifiers that change the whole meaning of the sentence is a worrying failure mode I've seen, and with mathematical problems there are a lot more cases like that.
I think that's exagerating a bit. If you are familiar with both Lean and LaTeX then I think transcribing these problems to Lean only takes about twice as long as transcribing them to LaTeX.
Think of problems in NP - you can check the answer efficiently, but finding the answer to check is the hard part... This is basically what we're looking at here: The proof checker can quickly evaluate correctness, but we need something to produce the proof, and that's the hard part.
The AI is the "solver network", which is the (directed) search over solutions generated by Lean. The AI is in doing an efficient search, I suppose.
I'm also waiting for my answer on the role of the Gemini formalizer, but reading between the lines, it looks like it was only used during training the "solver network", but not used in solving the IMO problems. If so then the hyping is greatly premature, as the hybrid formalizer/solver is the whole novelty of this, but it's not good enough to use end-to-end?
You cannot say AlphaProof learned enough to solve problems if formalization made them easier to solve in the first place! You can say that the "solver network" part learned enough to solve formalized problems better than prior training methods.
> When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean.
To me, this sounds like Alphaproof receives a "problem", whatever that is (how do you formalize "determine all X such that..."? One is asked to show that an abstract set is actually some easily understandable set...). Then it generates candidate Theorems, persumably in Lean. I.e. the set is {n: P(n)} for some formula or something. Then it searches for proofs.
I think if Alphaproof did not find {foo} but it was given then it would be very outrageous to claim that it solved the problem.
I as someone with a maths degree but who hasn't done this kind of thing for half a decade, was able to immediately guess the solution to (1) but actually proving it is much harder.
as a noob, i feel that formalizing is a major part of solving the problem by yourserlf. my assessment is that once you identify certain patterns, you can solve problems by memorizing some patterns. but people might me can struggle with the first stage and solve the wrong problem.
still good progress nonetheless. won't call the system sufficient by itself tho.
please pardon my ignorance, but to me a tower of hanoi question as a middle-schooler was the hardest thing to comprehend. but after learning about it, it is no longer quite as challenging to tackle.
i understand that there are very hard questions for the olympiad. but it might be possible to learn about some recurring types of them by looking at past instances. it may not be the meta for IMO but has been for other kinds of exams.
As is often the case, creating a well formed problem statement often takes as much knowledge (if not work) as finding the solution.
But seriously, if you can't ask the LLM to solve the right question, you can't really expect it to give you the right answer unless you're really lucky. "I'm sorry, but I think you meant to ask a different question. You might want to check the homework set again to be sure, but here's what I think you really want."
Presenting this just as "translating into formal language" omits important information.
Lean isn't just a formal language, it is also a theorem prover, Could the IMO participants use the nlinarith tactic? Could they use other tactics?
Of course not, they had to show their work!
Could they have mathematicians translate the problem statements into the formal language for them?
Of course not, they had to do it themselves. In "How to solve it" Polya stresses multiple times that formalizing the initial question is an important part of the process.
Then, the actual computational resources expressed in time are meaningless if one has a massive compute cloud.
I'm a bit dissatisfied with the presentation, same as with the AlphaZero comparison to an obsolete Stockfish version that has been debunked multiple times.
This is certainly impressive, but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.
That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.
But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.
It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?
Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?
I have met IMO competitors. They are insanely smart. I wouldn't imagine it's possible to be that smart before I started hanging in these circles. So more like 25% of 0.01% of high school students.
Time is not a very interesting dimension here, because humans don't use the same CPU as huge GPU clusters. The binary "is it able to reach a solution given enough resources?" is more interesting (for GPT/Claude the answer is a clear negative).
But many more of those students (including many not in the contest) could solve those problems given more time than the 4.5hr sessions.
If these problems were important to solve, redundantly by thousands or millions of people (like the real work that most people do), far more people would put in the effort to learn how to solve these problems.
It's just a weird comparison. Contests are very artificially structured in ways that don't make sense for comparing to computers.
I think what we need to look at is that AI systems were not able to do it before and now they are able to do it. Sooner these systems with millions of dollars of compute will just scale the algorithms and beat humans at this as well - just like they did for chess, go, shogi etc.
You probably misinterpreted the above: 25% of the participants in IMO, who are generally high school students. Never claimed that 25% of all high schoolers would do better.
Now, you may say how time is not a useful dimension here, but really, this is where we are seeing a lot of these advances come from: general researchers today do get access to huge compute capability, allowing them to quickly iterate on different ideas. In a sense, they can be less smart about their use of resources and simply try things out: this does drive the innovation (compared to waiting for their turn on a supercomputer nearby).
And finally, time is essential even for humans: given a couple hundred years, they will find the proof for Fermat's last theorem, but they might not do it in 4.5h. Since we are comparing AI capabilities to humans in the article, it's very possible that increased compute will never allow AI to find novel proofs we have not come up with either. That's where the AI bit comes in: we know that brute searching through the entire possible space of proofs is still too expensive for our compute capabilities, so we need AI to emulate our "intuition" when looking for the direction to narrow down the search.
So there are really two reasons time matters: 1. getting enough of compute might still be far away (heck, prime factorization and elliptic curves are still the basis of the most of world cryptography for that reason) and 2. maybe it's not even enough to increase compute capability to make huge jumps in problem solving capabilities (iow, maybe we are reaching maximum of where the approach can take us).
I've seen chess players, they are smart too. So what? Specialized model on commodity hardware beats most of them, if not all. This doesn't mean model is smarter. The same could be here. Is it possible that solutions are one step from model's database, or training set. Humans don't remember this much and have to do many more actions in their heads.
In other words model which can solve any/most school / college exam problem isn't necessarily smart. It can be just a database, or, in fact, a lookup table. Smarter version can be a lookup + 1 step test. Not saying it's bad, but it doesn't scale to less formalized domains. BTW, in this case formalization was done by humans.
> medals are awarded to 50% of the participants (high school students)
In case this confuses anyone: the high school students in question are not a standard sample of high school students. AFAIK, they are teams of the ~6 strongest competitive problem solving high school students from each country.
In my opinion (not Google s) the only reason they didn't get gold this year (apart from being unlucky on problem selection) is that they didn't want to try for any partial credit in P3 and P5. They are so close to the cut off and usually contestants with a little bit of progress can get 1 point. But i guess they didn't want to get a gold on a technicality--it would be bad press. So they settled in a indisputable silver
The AI took a day on one of the problems so it must have generated and discarded a lot of proofs that didn't work. How could it choose which one to submit as the answer, except the objective fact of the proof passing in Lean.
When tackling IMO problems, the hard part is coming up with a good approach to the proof. Verifying your proof (and rejecting your false attempts) is much easier. You'll know which one to submit.
You are a human, not an AI. You know whether your idea seems related to the solution. The AI has thousands of ideas and doesn't know which are better. Graders shouldn't accept a thousand guesses grubbing for 1 point.
If verifying a good idea is easy, then the evidence shows that the AI didn't have good ideas for the other 2 problems.
we are talking about lean proofs. Given a formal statement and a proof - the lean can verify whether it's correct or not. It's like generating computer programs to solve a problem - the problem lied in generating useful solutions/sub-solutions so that the search is effective. They achieve this via using gemini as a lean proof generator aka. using a world model LLM fine tuned to generate lean proofs in a more effective manner.
Humans are even better at this as you mention - but effectively the approach is similar. Come up with lot of ideas and see what proves it.
Well, he does have twice the amount of silver medals... And can speak the English language... Although, an AI attempting to speak with the human race entirely through an esoteric math-proofing language would be an interesting take on the whole, "humans make ET contact, interact through universal language.... pause for half of the movie, military hothead wanting to blow it out of the sky, until pretty lady mathematician runs into the oval office waving a sheet of paper... OF MATH!" trope....
But now, it's a race between you and I, to see who can write the screenplay first!
In maths, you are only solving a problem by building a formal solution that is the proof.
Eg. if it asked you to do something computationally hard (when done in a brute force way: calculators not allowed), and you put out the answer without the process using formal math knowledge, you'd get zero points. Even if it was easy to prove the answer was correct.
That's why IMO and many math exams won't take a single answer even if correct.
Yeah, the progress has to be quite significant, no points are awarded for trivial observations. Thus scores are usually bimodal around 0 and 7. In the linked stats you can see that 1 point for P3/P5 was less common than full score on other problems.
Problem 2 and 5 have a lot of 1s. Sometimes thre is an interesting advance that is easirr than the full solution. Also in problem 6 the only hope for most was to get that 1 point.
I don't believe anything was graded by the IMO, Google is just giving itself 7 for anything proved in Lean (which is reasonable IMO), so they can't really try for partial credit so much as choose not to report a higher self-graded number.
From the article: "Our solutions were scored according to the IMO’s point-awarding rules by prominent mathematicians Prof Sir Timothy Gowers, an IMO gold medalist and Fields Medal winner, and Dr Joseph Myers, a two-time IMO gold medalist and Chair of the IMO 2024 Problem Selection Committee."
Fair enough. Although I think the question is whether P3/P5 were given zero points vs not evaluated vs evaluated but not published. I don't think it is surprising that Lean-verified proofs get a 7.
It is an interesting question: how this new system quantifies progress and whether it can detect that a chain of reasoning is close to the solution. Although it was just 2 points shy of a golden medal (1 point partial credit per problem not solved?), I'd suspect that the team would want a clean victory, given the magnitude of the claim. I bet next year's results would be 42/42. Deep Mind is unstoppable!
One key difference between giving humans more time and giving computer programs more time is that historically we have had more success making the latter run faster than the former.
> What's the need to taint the impressive result with apples-to-oranges comparison?
Most of DeepMind’s research is a cost-centre for the company. These press releases help justify the continued investment both to investors and to the wider public.
Just bigger chip area at lower clock could do it too if there is no threshold on cost anyway and they were limited by power. Google would hit enough accumulated production of AI chip area and cluster build out even if Moore's law stopped. There will likely be big algorithmic improvments too.
The complexity of truth doesn't fit in a headline, let alone attract enough clicks to win the competition for eyeballs and virality - which means that even if someone somehow succeeded in telling the whole truth in a headline, that article would lose in the race for eyeballs to the clickbait-headline version, which means your eyeballs would (statistically) never see it, and only the clickbait version.
I'd say "welcome to the web" but this was true in 1800s newspapers as well.
This is the real deal. AlphaGeometry solved a very limited set of problems with a lot of brute force search. This is a much broader method that I believe will have a great impact on the way we do mathematics. They are really implementing a self-feeding pipeling from natural language mathematics to formalized mathematics where they can train both formalization and proving. In principle this pipeline can also learn basic theory building like creating auxilliary definitions and Lemmas. I really think this is the holy grail of proof-assistance and will allow us to formalize most mathematics that we create very naturally. Humans will work podt-rigorously and let the machine asisst with filling in the details.
Agreed, this is a big step forward. Geometry problems are in a different class, since you can translate them into systems of polynomial equations and use well known computer algebra algorithms to solve them.
By contrast, this kind of open ended formalization is something where progress used to be extremely slow and incremental. I worked in an adjacent field 5 years ago and I cannot stress enough that these results are simply out of reach for traditional automated reasoning techniques.
Real automatic theorem proving is also useful for a lot more than pure mathematics. For example, it's simple to write out an axiomatic semantics for a small programming language in lean and pose a question of the form "show that there exists a program which satisfies this specification".
If this approach scales it'll be far more important than any other ML application that has come out in the last few years.
> Agreed, this is a big step forward. Geometry problems are in a different class, since you can translate them into systems of polynomial equations and use well known computer algebra algorithms to solve them.
The blog post indicates the opposite. The geometry problem in the IMO problem set was solved by AlphaGeometry 2, which is an LLM based on Google's Gemini. LLMs are considered relatively general systems. But the other three solved problems were proved by AlphaProof, which is a narrow RL system that is literally based on AlphaZero, the Go and Chess AI. Only its initial (bootstrapping) human training data (proofs) were formalized and augmented by an LLM (Gemini).
Only slightly more general. It only works for games that are zero-sum, deterministic, have no hidden information, and discrete game state and moves. Other examples include connect-4.
So finding Lean proofs can be conceptualized as a zero-sum game?
Another basic requirement is that valid moves / inference steps and the winning condition can be efficiently verified using some non-AI algorithm. Otherwise there would not be a reward signal for the reinforcement learning algorithm. This is different from answering most natural language questions, where the answer can't be checked trivially.
Theorem proving can be formulated as a game, see e.g., https://plato.stanford.edu/entries/logic-games/ and interactive theorem provers can verify that a proof is correct (and related sub problems, such as that a lemma application is valid).
Conceptually, if you're trying to show a conjunction, then it's the other player's turn and they ask you for a proof of a particular case. If you're trying to show a disjunction then it's your turn and you're picking a case. "Forall" is a potentially infinite conjunction, "exists" is a potentially infinite disjunction.
In classical logic this collapses somewhat, but the point is that this is still a search problem of the same kind. If you want to feel this for yourself, try out some proofs in lean or coq. :)
I don't think AlphaZero is related to this work, apart from both being NN-based. AlphaZero and its training pipeline fundamentally only works for "chess-like" two-player games, where the agent can play against itself and slowly improve through MCTS.
"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."
Don't dismiss search, it might be brute force but it goes beyond human level in Go and silver at IMO. Search is also what powers evolution which created us, also by a lot of brute forcing, and is at the core of scientific method (re)search.
What makes solving IMO problems hard is usually the limits of human memory, pattern-matching, and search, not creativity. After all, these are problems that are already solved, and it is expected that many people can solve the problems in about 1 hour's time.
That makes it, in principle, similar or even easier than a champsionship-level chess move, which often take more than 1 hour for a professional human (with more training than an IMO high school student) to solve.
Another interesting concern is that when posing a problem to humans, it's fine to pose an "easy" brute-forceable problem, but humans, being slow brute-searchers, need to find more clever solutions. But if you give such a problem to a computer, it can trivialize it. So to test a computer, you need to pose non- easily-brute-forceable problems, which are harder for the computer than the others, but equally difficult for the humans as the other problems are.
Why do you say these are problems that are already solved? Sure, they're often variations on existing themes, but the same is true for chess positions and, honestly, almost everything else in any field of human endeavor.
Agreed that the absolute upper tier of chess players have trained longer and harder than most or all IMO contestants. Though I do wonder which (top-tier chess or the IMO) draws on a larger talent pool. To my understanding, a significant fraction of all high school students on Earth take some form of qualifying exam which can channel them into an IMO training program.
And as far as the being amenable to brute force (relative difficulty for humans vs. computers): it seems that chess was comparatively easier for computers, IMO problems are comparatively easier for humans, and the game of Go is somewhere in between.
These problems are literally already solved? Of course, the IMO problem designers make sure the problems have solutions before the use them. That's very different than math research, where it's not known in advance what the answer is, or even that there is good answer.
I'm saying they weren't solved until the problem composer (created and) solved them. They're not, in general, problems for which solutions have been lying around. So "these are problems that are already solved" isn't introducing anything interesting or useful into the discussion. The post I was replying to was trying to draw a contrast with chess moves, presumably on the grounds that (after the opening) each position in a chess game is novel, but IMO problems are equally novel.
It's true that IMO problems are vetted as being solvable, but that still doesn't really shed any information on how the difficulty of an IMO problem compares to the difficulty of chess play.
The main difficulty to scaling Alpha Proof is finding theorems to train it with. AlphaGo didn't have that problem because it could generate it's own data.
And I understand the upper time limit for each question was 4.5 hours. So it solved one question almost immediately, two well over the allotted time (60 hrs), and two not at all. No medal for you, Grasshopper.
Contestants get 4.5 hours for each of the two days of competition. They have to solve three problems in that time, so on average you can spend 1.5 hours per problem (if you're aiming to finish all three).
That said, the gap from "can't do it at all" to "can do it in 60 hours" is probably quite a bit larger than the gap from 60 hours to 1.5 hours.
Timing something that can be ran faster by throwing better hardware at it honestly feels conceptually irrelevant, as long as the complexity is actually tractable.
Search is great, search works, but there was not a tonne to learn from the AlphaGeometry paper unless you were specifically interested in solving geometry problems.
I would argue that no actually searchable solution space is really infinite (if only because infinite turing machines can't exist). Finite solution spaces can get more than large enough to be intractable.
What about ℕ? Seems pretty infinite to me, unless with "actually" you mean finite in time and space, which would make your argument a tautology. Or am I missing something?
Almost every "number" "in" N doesn't actually exist. In the search for numbers that exist, we will most likely only ever find a finite set of numbers before the Universe or humanity dies.
Searches happen in finite time an space and, more importantly, systems performing those searches have practical finite limits on parameters that determine size of the space within which that search can take place (such as available memory).
Even within fairly modest finite limits, you can produce a solution space that cannot be significantly searched with the available finite matter and time available in the observable universe.
Thus, the problem with using search isn't that solution spaces can be infinite, but that finite solution spaces can be unimaginably large.
For some problems validation is expensive. Like the particle collider or space telescope, or testing the COVID vaccine. It's actually validation that is the bottleneck in search not ideation.
You mean that by improving search we can solve any problem? What if solution field is infinite, even if we make search algo 10x100 more performant, solution field will still be infinite, no?
I imagine a system like this to be vastly more useful outside the realm of mathematics research.
You don't need to be able to prove very hard problems to do useful work. Proving just simple things is often enough. If I ask a language model to complete a task, organize some entries in a certain way, or schedule this or that, write a code that accomplishes X, the result is typically not trustworthy directly. But if the system is able to translate parts of the problem to logic and find a solution, that might make the system much more reliable.
My intuition is that a regular LLM is better att coming up with a correct task description from a fuzzy description than it is at actually solving tasks.
As resident strident AI skeptic, yeah, this is real.
But MCTS was always promising when married to large NNs and DeepMind/Brain were always in front on it.
I don’t know who fucked up on Gemini and it’s concerning for Alphabet shareholders that no one’s head is on a spike. In this context “too big to fail” is probably Pichai.
But only very foolish people think that Google is lying down on this. It’s Dean and Hassabis. People should have some respect.
And while AlphaProof is clearly extremely impressive, it does give the computer an advantage that a human doesn't have in the IMO: nobody's going to be constructing Gröbner bases in their head, but `polyrith` is just eight characters away. I saw AlphaProof used `nlinarith`.
Don't think of lean as a tool that the ai is using (ie, cheating), think of lean plus AlphaProof as a single system. There's no reason to draw artificial boundaries around where the AI is and where the tools that the AI is using are. Lean itself is a traditional symbolic artificial intelligence system.
People want always knock generative AIs for not being able to reason, and we've had automated systems that reason perfectly well for decades, but for some reason that doesn't count as AI to people.
> I want my AI to use all the advantages it has to reinvent the landscape of mathematics
The interesting thing about math, or science, and art in general, comparing it to games like chess or go is that science gives you the freedom to continue to excel as a human while in games we have lost the game and/or the league.
Science and art are infinite and no AI can produce infinity.
Sure but note that's not "your AI". It's a closed-source, proprietary system by DeepMind who typically publish a result to reap the hype and then bury the system forever (see AlphaGo).
This does sound like a lot of fun. Since this AI only reaches silver level, presumably such a contest could be quite competitive, with it not yet being clear cut whether a human or a tool-assisted human would come out on top, for any particular problem.
nlinarith is a proof automation that attempts to finish a proof using the simplex method to find a linear combination of hypotheses and things that have already been proven, as well as some quadratic terms of them.
That's amazing. I was just about to comment that hooking this up to Lean [1] would be killer. This must be the way forward for higher math, as proofs are getting so complicated that almost no one understands all pieces of major proofs.
I think you might be thinking of the recent project to start Fermat's Last Theorem? The Riemann hypothesis has been easy to state (given what's in Mathlib) for years.
A good brief overview here from Tim Gowers (a Fields Medallist, who participated in the effort), explaining and contextualizing some of the main caveats: https://x.com/wtgowers/status/1816509803407040909
> ... but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.
yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.
this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.
suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !
Some countries pull these kids out of school for an entire year to focus on training for it, while guaranteeing them entry into their nation's top university.
I'm seriously jealous of the people getting paid to work on this. Sounds great fun and must be incredibly satisfying to move the state of the art forward like that.
I don't know about that. A lot of the work that should have been very satisfying turned out to be boring as hell, if not toxic, while at the same time, some apparently mundane stuff turned out to be really exciting.
I found the work environment to be more important than the subject when it comes to work satisfaction. If you are working on a world changing subject with a team of assholes, you are going to have a bad time, some people really have a skill for sucking the fun out of everything, and office politics are everywhere, especially on world changing subjects.
On the other hand, you can have a most boring subject, say pushing customer data to a database, and have the time of your life: friendly team, well designed architecture, time for experimentation and sharing of knowledge, etc... I have come to appreciate the beauty of a simple thing that just works. It is so rare, maybe even more rare than scientific breakthroughs.
Now, you can also have an awesome work environment and an awesome subject, it is like hitting the jackpot... and a good reason to be envious.
Awesome work environment for one person can be not ideal for another.
Pretty much all the top AI labs are both intensely competitive and collaborative. They consist of many former IMO and IOI medalists. They don't believe in remote work, either. Even if you work at Google DeepMind, you really need to be in London for this project.
The open-source software projects these companies critically depend on are developed by collaborators who have never met in person, and yet these companies still believe you can only do great work in the office.
I work in this space (pretraining LLMs). It looks fancier than it really is. It does involve wrangling huge ymls and writing regular expressions at scale (ok I am oversimplifying a bit). I should be excited (and grateful) that I get to work on these things but shoddy tooling takes the joy out of work.
Huh. So I tried to look it up just now and I'm not sure if I understand the difference. (To the extent that there is one - apparently one can mean the other, but I imagine they're usually used as follows.)
It looks like "jealous" is more being afraid of losing something you have (most commonly e.g. a spouse's affection) to someone else, whereas "envious" is wanting what someone else has?
No worries. I learned this too a while ago (was also using jealous instead of envious and vice versa myself). From my understanding the use of jealous is when you have something but that is threatened by some external factor, eg a partner, a friend having more fun with somebody else. Envious is when you covet something that you do not have currently but wish to, which is in this case playing with exciting tech.
I'm sure humans will always enjoy chess and art regardless of how much better AI is at it. In the same way, there will probably always be mathematic hobbyist who study math for fun. But I seriously doubt that in the near future there will be mathematicians who will be publishing new advancements that aren't mostly or entirely discovered by AI. A human might get credit for a proof for asking the initial question, but there is pretty much no world where a computer can easily solve a meaningful mathematical problem but we insist on a human solve it more slowly and expensively instead.
Funny how people don't understand basic logic. If it is a proof in a logic, and the machine checked that proof, it is a proof, no matter that no human actually understands it.
A human doesn't need to understand the proof, they just have to understand why the proof is a proof.
The useful thing about proofs is that they are written in English (or another language), not formal logic. In general they can be mapped to formal logic, though. This means that people can digest them on an intuitive level. The actual goal of a proof is to create new knowledge (via the proof) which can be distributed across the mathematical community. If proofs exist but are not easily comprehensible, then they don’t accomplish this goal.
If a proof is understandable only by a few people in the world, I would say it fails as a proof, according to your definition. There are many such proofs out there right now. It doesn't help that many of them are wrong, albeit usually in a fixable way.
Making a proof formal doesn't mean it is not understandable any more. First, certainly a machine can "understand" it now. I think with AI improving what exactly such an understanding is worth, and what you can do with it, will increase a lot.
Secondly, the inscrutable tactics-based proofs featured in Lean (I have written such proofs in Isabelle in 1996) are actually quite old-fashioned for such a "modern" prover. Isabelle has long featured human-friendly syntax, and these proofs are certainly as understandable as English text, if written with care.
What we will soon get are proof assistants which allow you to write your proofs (and definitions) in English, but which are still fully formal and checkable.
That is an immense help if you are producing new knowledge, because usually nobody else looks at your work with sufficient scrutiny in the first place. If you understand it, and in addition the machine "understands" it, I think that will be the gold standard for proofs very soon.
It will also help mathematicians to clean up their act ;-)
Luckily, that is not necessary. You can make many mistakes, until you arrive at a logic you are happy with. Then you talk with other humans about it, and eventually you will all agree, that to the best of your knowledge, there is no mistake in the logic. If you pick first-order logic, that has already been done for you.
Then you need to implement that logic in software, and again, you can and will mistakes here. You will use the first version of that software, or another logic software, to verify that your informal thoughts why your logic implementation is correct, can be formalised and checked. You will find mistakes, and fix them, and check that your correctness proof still goes through. It is very unlikely that it won't, but if it doesn't, you fix your correctness proof. If you can indeed fix it, you are done, no mistakes remain. If you cannot, something must be wrong with your implementation, so rinse and repeat.
At the end of this, you have a logic, and a logic implementation, which doesn't contain any mistakes. Guaranteed.
I would rather say that I actually have a definition of what a proof is, and you don't. But feel free to prove me wrong (pun intended), and tell me yours.
Yes, it was still proven. If I don't speak English but come up with a proof, I can't communicate the proof, but I have still proven (ie created proof) it.
I think that is unlikely to be the case - the classic example of a proof that human's "can't understand" is the Four Colour Theorem, but thats because the proof is a reduction to like 100000 special cases which are checked by computer.
To what extent is the proof of Fermat's Last Theorem "incomprehensible to humans" because only like a dozen people on the planet could truly understand it - I don't know.
The point of new proofs is really to learn new things about mathematics, and I'm sure we would learn something from a proof of Goldbach's conjecture.
Finally if it's not peer reviewed then its not a real proof eh.
I think the 4-color theorem is rather different though. It reduced to a large number of cases that can in principle each be verified by a human, if they were so inclined (indeed a few intrepid mathematicians have done so over the years, at least partially.) The point of using a computer was to reduce drudgery, not to prove highly non-obvious things.
Thinking back to Wiles' proof of FLT, it took the community several years of intense work just to verify/converge on the correct result. And that proof is ~130 pages.
So, what if the computer produced a provably correct, 4000-page proof of the Goldbach conjecture?
> This principle is broadly extensible to work and art
Nah, as a consumer it makes no difference to me if a meat packing factory or Amazon warehouse employs 5000 or 5 people. To art, this principle is totally real, but for work, it only applies to some/most of it.
Read the next sentence: "We only care about machines insofar as it serves us."
Imagine a machine doing "work" that only serves itself and other machines that does no service to humanity. It would have no economic value. In fact the whole concept of "work" only makes sense if it is assigned economic value by humans.
Then we agree, but your chess example made it sound like if a machine could automatically pack meat with little human intervention, people wouldn't want it. That's also not the next sentence. How is it broadly applicable to work?
There are people that believe that mathematics is actually useful, in ways that chess or art are not. I know, most mathematicians don't think so. But let us just entertain this crazy thought for a moment. Then a proof is just a tool that tells us, oh, we have applied this piece of mathematics right. No understanding of the proof is actually required for that, and no one cares if some mathematician somewhere actually fully understands the proof. It will be OK, even expected, that the machine is better than us at finding and checking proofs.
I don't think this principle extends to math proofs. It's much, much easier to verify a proof than to create it, and a second proof will just be a footnote. Not many mathematicians will want to work on that. That said, there is a lot of distance between IMO and the frontiers of research math.
Actually, I was looking up Elo ratings of the top computer chess players, and learned that it is not that trivial to compare these, due to differences in hardware requirements and whatnot.
Eh, people definitely care. AI has completely changed chess. Non viable lines have been proven viable and vice versa. All the pros study and develop new lines with AI.
Carlsen plays still at the highest level. He just didn't want to do the world championship anymore, wasn't worth the effort after winning it so many times.
Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.
IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.
This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...
> Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.
This seems so weird to me - AGI is undefined as a term imo but why would you expect "producing something generally intelligent" (i.e. median human level intelligence) to be significantly harder than "this thing is better than Terrence Tao at maths"?
My intuition tells me we humans are generally very bad at math. Proving a theorem, in an ideal way, mostly involves going from point A to point B in the space of all proofs, using previous results as stepping stones. This isn't particularly a "hard" problem for computers which are able to navigate search spaces for various games much more efficiently than us (chess, go...).
On the other hand, navigating the real world mostly consists in employing a ton of heuristics we are still kind of clueless about.
At the end of the day, we won't know before we get there, but I think my reasons are compelling enough to think what I think.
I don't think that computers have an advantage because they can navigate search spaces efficiently. The search space for difficult theorems is gigantic. Proving them often relies on a combination of experience with rigorous mathematics and very good intuition [1] as well as many, many steps. One example is the classification of all finite simple groups [2], which took about 200 years and a lot of small steps. I guess maybe brute forcing for 200 years with the technology available today might work. But I'm sceptical and sort of hope that I won't be out of a job in 10 years. I'm certainly curious about the current development.
Oh for sure, when I say "soon" it's only relative to AGI.
What I meant to convey, is that theorem proving at least is a well-defined problem, and computers have had some successes in similar-ish search problems before.
Also I don't think pure brute-force was ever used to solve any kind of interesting problem.
Chess engines make use of alpha-beta pruning plus some empirical heuristics people came up with over the years. Go engines use Monte-Carlo Tree Search with straight deep learning models node evaluation. Theorem proving, when it is solved, will certainly use some kind of neural network.
Replace "this thing is better than Terrence Tao at maths" with "this thing is better than Garry Kasparov at chess" and your statement would sound equally reasonable in the early 90s.
Because being good/competent at EVERYTHING is harder than being great at any one thing for a computer, since generality requires fluid intelligence but a single specialist skill can be achieved through brute force search, as we've seen over and over again from DeepBlue 30 years ago to the present time.
In the meantime, while DeepBlue beat the world chess champion, Kasparov, at chess, our best efforts at generalism - LLMs than many (not me!) think are the path to AGI - struggle to play tic tac toe.
Not to join the hate train but it probably isn't better than Terrence Tao at solving the problems either, since this Tao would have probably been able to solve the problem himself and would have probably taken less time. There were participants in this contest that were better than the AI.
Seems fair - I strongly think that "certain formalized problems" is very very far away from doing actual maths! And that actual maths is sometimes much less about solving known conjectures and much more about developing new theories.
They didn't formalize the entirety of math. Good thing imo doesn't need the entirety. But they didn't even formalize enough for imo--this is probably why combo wasn't solved
Tangentially: I found it fascinating to follow along the solution to Problem 6: https://youtu.be/7h3gJfWnDoc (aquaesulian is a node to ancient name of Bath). There’s no advanced math and each step is quite simple, I’d guess on a medium 8th grader level.
Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.
I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!
Once Gemini, the LLM, integrates with AlphaProof and AlphaGeometry 2, it might be able to reliably perform logical reasoning. If that's the case, software development might be revolutionized.
"... We'll be bringing all the goodness of AlphaProof and AlphaGeometry 2 to our mainstream #Gemini models very soon. Watch this space!" -- Demis Hassabis, CEO of Google DeepMind.
https://x.com/demishassabis/status/1816499055880437909
> First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.
Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.
I haven't read TFA as I'm at work, but I would be very interested to know what the system was doing in those three days. Were there failed branches it explored? Was it just fumbling its way around until it guessed correctly? What did the feedback loop look like?
I can't find a link to an actual paper, that just seems to be a blog post. But from what I gather the problems were manually translated to Lean 4, and then the program is doing some kind of tree search. I'm assuming they are leveraging the proof checker to provide feedback to the model.
Yeah, as a former research mathematician, I think “fumbling around blindly” is not an entirely unfair description of the research process.
I believe even Wiles in a documentary described his search for the proof of Fermat’s last theorem as groping around in a pitch black room, but once the proof was discovered it was like someone turned the lights on.
They just write "it's like alpha zero". So presumably they used a version of MCTS where each terminal node is scored by LEAN as either correct or incorrect.
Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).
The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
So they had three days to keep training the model, on synthetic variations of each IMO problem.
Don't confuse interpolation with extrapolation. Curing cancer will require new ideas. IMO requires skill proficiency in tasks where the methods of solving are known.
The methods are know, but the solutions to the IMO problems weren't. So the AI did extrapolate a solution.
Also, there's no reason to affirm that an eventual cure for cancer requires fundamentally new methods. Maybe the current methods are sufficient, it's just that nobody has been "smart" enough to put the pieces together. (disclaimer: not an expert at all)
I think you are correct though. We don't need new physics to cure cancer. But we may need information-handling, reasoning and simulation systems which are orders of magnitude bigger and more complex than anything we have this year. We also need to stop pussy-footing and diddling with ideologies and start working on the root cause of cancer and almost every other disease, which is aging.
I think this is kinda false actually on the cancer side. We have reached a point where we have known approaches that work. It's "just" a matter of putting them into practice which will of course require solving many little details, which is very important and time-consuming work, but it doesn't require super-human genius level of lateral thinking, just a few millions man years of grinding away at it.
new doesn't necessarily mean "an extremal point that's not the average of two existing points". The set of existing knowledge is not necessarily continuous; the midpoint between two known points may be unknown, and thus would be a "new" point that could be obtained by interpolation.
Search is extrapolation. Learning is interpolation. Search+Learn is the formula used by AZ. Don't forget AZ taught us humans a thing or two about a game we had 2000 years head start in, and starting from scratch not from human supervision.
no, search is not extrapolation. Extrapolation means taking some data and projecting out beyond the limits of that data. For example, if my bank account had $10 today and $20 tomorrow, then I can extrapolate and say it might have $30 the day after tomorrow. Interpolation means taking some data and inferring the gaps of that data. For example, if I had $10 today and $30 the day after tomorrow, I can interpolate and say I probably had $20 tomorrow.
Search is different from either of those things, it's when you have a target and a collection of other things, and are trying to find the target in that collection.
Search can go from a random init model to beating humans at Go. That is not interpolation.
- Search allows exploration of the game tree, potentially finding novel strategies.
- Learning compresses the insights gained from search into a more efficient policy.
- This compressed policy then guides future searches more effectively.
Evolution is also a form of search, and it is open-ended. AlphaProof solved IMO problems, those are chosen to be out of distribution, simple imitation can't solve them. Scientists do (re)search, they find novel insights nobody else discovered before. What I want to say is that search is on a whole different level than what neural nets do, they can only interpolate their training data, search pushes outside of the known data distribution.
It's actually a combo of search+learning that is necessary, learning is just the little brother of search, it compresses novel insights into the model. You can think of training a neural net also as search - the best parameters that would fit the training set.
The problem solved "within minutes" is also interesting. I'd interpret that as somewhere between 2 and 59 minutes. Given the vagueness probably on the higher end, otherwise they'd celebrate it more. The students had 6 tasks in 9 hours, so on average 1.5h per task. If you add the time a student would take to (correctly!) translate the problems to their input format, their best-case runtime is probably about as fast as a silver-medalist would take to solve the problem on their own.
But even if they aren't as fast as humans yet this is very valuable. Both as a stepping stone, and because at a certain scale compute is much easier to scale than skilled mathematicians.
They say "our systems" (presumably meaning AlphaProof and AlphaGeometry 2) solved one problem "within minutes", and later on the page they say that the geometry question (#4) was solved by AlphaGeometry in 19 seconds.
So either (1) "within minutes" was underselling the abilities of the system, or (2) what they actually meant was that the geometry problem was solved in 19 seconds, one of the others "within minutes" (I'd guess #1 which is definitely easier than the other two they solved), and the others in unspecified times of which the longer was ~3 days.
I'd guess it's the first of those.
(Euclidean geometry has been a kinda-solved domain for some time; it's not super-surprising that they were able to solve that problem quickly.)
As for the long solve times, I would guess they're related to this fascinating remark:
> The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
There are known algorithms that can solve _all_ problems in euclidean (ruler-and-compasses) geometry, no intuition required. The most effective algorithms of this type are quite inefficient, though, and (at least according to DeepMind) don't do as well as AlphaGeometry does at e.g. IMO geometry problems.
And say they did use 10% of all GCP? Would it be less impressive? This is a result that was considered by experts to be far beyond the state of the art; it's absolutely ok if it's not very efficient yet.
Also, for what it's worth, I'm pretty sure that I wouldn't have been able to solve it myself in three days, even if I had access to all of GCP, Azure and AWS (except if I could mine crypto to then pay actual IMO-level mathematicians to solve it for me).
The writing was on the wall for the last year and a half (in fact I lost a bet to an IMO medalist about AI getting IMO gold by 8/2023) but three years ago this was unimaginable.
The thing is though, once we have a benchmark that we pass, it’s pretty typical to be able to bring down time required in short order through performance improvements and iterating on ideas. So if you knew you had GAI but it took 100% of all GCP for 3 years to give a result, within the next 5 years that would come down significantly (not least of which you’d build HW dedicated to accelerating the slow parts).
That's patently false for many classes of problems. We know exactly how to solve the traveling salesman problem, and have for decades, but we're nowhere close to solving a random 1000 city case (note: there are approximate methods that can find good, but not optimal, results on millions of cities). Edit: I should say 1,000,000 city problem, as there are some solutions for 30-60k cities from the 2000s.
And there are good reasons to believe that theorem finding and proof generation are at least NP-hard problems.
We're not talking about mathematical optimality here, both from the solution found and for the time taken. The point is whether this finds results more cheaply than a human can and right now it's better on some problems while others it's worse. Clearly if a human can do it, there is a way to solve it in a cheaper amount of time and it would be flawed reasoning to think that improving the amount of time would be asymptotically optimal already.
While I agree that not all problems show this kind of acceleration in performance, that's typically only true if you've already spent so much time trying to solve it that you've asymptoted to the optimal solution. Right now we're nowhere near the asymptote for AI improvements. Additionally, there's so many research dollars flowing into AI precisely because the potential upside here is nowhere near realized and there's lots of research lines still left to be explored. George Hinton ended the AI winter.
Well if it takes 10% of all of Google’s servers 3 days to solve, you may find it difficult to scale out to solving 1000 problems in 3 days as well.
As for humans, 100 countries send 6 students to solve these problems. It also doesn’t mean that these problems aren’t solvable by anyone else. These are just the “best 6” where best = can solve and solve most quickly. Given a three day budget, 1000 problems could reasonably be solvable and you know exactly who to tap to try to solve them. Also, while the IMO is difficult and winners tend to win other awards like Field Medals, there’s many professional mathematicians who never even bother because that type of competition isn’t interesting to them. It’s not unreasonable to expect that professional mathematicians are able to solve these problems as well if they wanted to spend 3 days on it.
But in terms of energy per solve, humans are definitely cheaper. As you note the harder part is scaling it out but so far the AI isn’t solving problems that are impossible for humans, just that given enough time it managed to perform the same task. That’s a very promising result but supremacy is slightly a ways off for now (this AI can’t win the competition for now)
The person said typical not always the case. Just because there are obviously cases where it didn't happen does mean it it's still not typically the case.
Still waiting for the first one. I'm not holding my breath - just like fuzzing found a lot of vulnerabilities in low-level software, I expect novel automated analysis approaches will yield some vulnerabilities - but that won't be a catastrophic event just like fuzzing wasn't.
Why don't you think that AI models will, perhaps rather soon, surpass human capabilities in finding security vulnerabilities? Because an AI that's even equally competent would be a fairly catastrophic event.
It's rumored that the NSA has 600 mathematicians working for them. If they are the ones finding the exploits you will probably never hear about them until they are independently discovered by someone who can publish.
It feels pretty disingenuous to claim silver-medal status when your machine played by significantly different rules. The article is light on details, but it says they wired it up to a theorem prover, presumably with feedback sent back to the AI model for re-evaluation.
How many cycles of guess-and-check did it take over the course of three days to get the right answer?
If the IMO contestants were allowed to use theorem provers and were given 3 days (even factoring in sleep) would AlphaProof still have gotten silver?
> let's be real I'd be okay waiting a month for the cure to cancer.
I don't think these results suggest that we're on the brink of knowledge coming at a substantially faster rate than before. Humans have been using theorem provers to advance our understanding for decades. Now an LLM has been wired up to one too, but it still took 8x as long to solve the problems as our best humans did without any computer assistance.
First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.
Secondly, this is now some hacky trick where Brute force and some theorem prover magic are massaged to solve a select few problems and then you'll never hear about it again. They are building a general pipeline which turns informal natural lamguage mathematics (of which we have ungodly amounts available) into formalized mathematics, and in addition trains a model to prove such kinds of mathematics. This can also work for theory building. This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.
> First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.
If this were the case then the headline would be "AI solves 4/6 IMO 2024 problems", it wouldn't be claiming "silver-medal standard". Medals are generally awarded by comparison to other contestants, not to the challenges overcome.
> This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.
This is great, and I'm not complaining about what the team is working on, I'm complaining about how it's being sold. Headlines like these from lab press releases will feed the AI hype in counterproductive ways. The NYT literally has a headline right now: "Move Over Mathematicians, Here Comes AlphaProof".
At the IMO "silver medal" afaik is define as some tange of points, which more or less equals some range of problems solved. For me it is fair to say that "silver-medal performance" is IMO langauge for about 4/6 problems solved. And what's the problem if some clickbait websites totally spin the result? They would've done it anyways even with a different title, and I also don't see the harm. Let people be wrong.
and we don't have 100% accuracy in translation in ambiguous texts, because system often need some domain knowledge, context etc. And math has 0% tolerance to mistakes.
I also expect that math formalized by machine will be readable by machine and hardly understandable by humans.
I'm not sure it matters that it had access to a theorem prover. The fact that it's possible to build a black box that solves hard proofs on its own at all is the fascinating bit.
> it still took 8x as long to solve the problems as our best humans did without any computer assistance.
Give it a year and that ratio will be reversed. At least. But also it matters less how long it takes if doubling the number of things reasoning at a best-human level is pronounced "ctrl-c, ctrl-v".
I am so exhausted of the AI hype nonsense. LLMs are not fucking curing cancer. Not now, not in five years, not in a hundred years. That's not what they do.
LLM/ML is fascinating tech that has a lot of legitimate applications, but it is not fucking intelligent, artificial or otherwise, and I am sick to death of people treating it like it is.
It seems unlikely people will employ only ML models, especially LLM, to achieve great results: they will combine it with human insights (through direction and concrete algorithms).
It's obvious that's happening with LLMs even today to ensure they don't spew out too much bullshit or harmful content. So let's get to a point where we can trust AI as-is first, and let's talk about what's needed to achieve the next milestone after and if we get there.
And I love asking every new iteration of ChatGPT/Gemini something along the lines of "What day was yesterday if yesterday was a Thursday?" It just makes me giggle.
Thanks for that what day was yesterday prompt. I have ran across these situations before but never quite like that.
What is great about that Thursday prompt is how naked the LLM is to the reality that it knows absolutely nothing in the way we think of "to know". The bubble we are in is just awesome to behold.
A significant part of the problem is that a majority of people are unaware of just how simple what they consider "intelligence" really is. You don't need actual intelligence to replace the public-facing social role of a politician, or a talking head, or a reactive-only middle manager. You just need words strung together that fit a problem.
The kicker with some of those math competition problems, there will be problems that reduce to finding all natural numbers for which some statement is true. These are almost always small numbers, less than 100 in most circumstances.
Which means these problems are trivial to solve if you have a computer - you can simply check all possibilities. And is precisely the reason why calculators aren't allowed.
But exhaustive searches are not feasible by hand in the time span the problems are supposed to be solved - roughly 30 minutes per problem. You are not supposed to use brute force, but recognize a key insight which simplifies the problem. And I believe even if you did do an exhaustive search, simply giving the answer is not enough for full points. You would have to give adequate justification.
Some more context is provided by Tim Gowers on Twitter [1].
Since I think you need an account to read threads now, here's a transcript:
Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.
It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.
However, that statement needs a bit of qualifying.
The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.
If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.
Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.
Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.
As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.
However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.
It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.
So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.
That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.
Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.
It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.
But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.
The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".
However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.
> If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.
Or if AlphaProof used more compute they could have slashed that time to 1/10 or less. It's arbitrary as long as we don't define what is the compute the AI should be entitled to use here.
This is quite cool! I've found logical reasoning to be one of the biggest weak points of LLMs, nice to see that an alternative approach works better!
I've tried to enlist gpt to help me play a android game called 4=10, where you solve simple math problems, and gpt was hilariously terrible at it. It would both break the rules I described, and make math mistakes, such as claiming 6*5-5+8=10
I wonder if this new model could be integrated with an LLM somehow? I get the feeling that combining those two powers would result in a fairly capable programmer.
Also perhaps a LLM could do the translation step that is currently manual?
Formalization is in principle just a translation process and should be a much simpler problem than the actual IMO problem. Besides, they also trained a Gemini model which formalizes natural language problems, and this is how they generated training data for AlphaProof. I would therefore expect that they could have also formalized the IMO problems with that model and just did it manually because the point is not to demonstrate formalizing but instead proof capabilities.
Yet the facts at hand are the opposite of what you say. Reliable formalizer was the more difficult problem than solving formalized IMO problems, because they have not produced one.
That does not necessarily follow from the facts at hand. For example they may have prioritized work on the proof solver itself as they may feel that that is the more important result. Alternatively if their goal is to build a proof solver then building the formalizer would be useless if they could not build the actual proof solver.
A proof solver existed. They were improving the proof solver explicitly by making the formalizer a part of the training. Formalizer reliability is the key novelty. It turns out it was only reliable enough for training. So unless they made the problem statement at the outset that "we'll only make the formalizer strong enough to train but not use", I disagree with that assessment.
That's a good point. The formalizer was used to created the training data for the proof solver, so they likely worked on it more than if they just used it as a preprocessing step during inference. It is still possible that they worked on the formalizer until they got good results from it enough to create good training data, and then began training as soon as possible and did not spend too much time trying to improve the formalizer. Depending on how long the training was expected to take, perhaps that is a reasonable assumption. Although I think I agree more with your view now.
> Formalization is in principle just a translation process and should be a much simpler problem than the actual IMO problem
maybe not, because you need to connect many complicated topics/terms/definitions together, and you don't have a way to reliably verify if formalized statement is correct.
They built automatic formalization network in this case, but didn't trust it and formalized it manually.
Yes and it is difficult for me to believe that there is not useful human analysis and understanding involved in this translation that the AI is helpless without. But that I suppose is a problem that could be tackled with a different model...
I find it incredibly impressive you did this in 15 minutes! You should really help out in formalizing math (completely serious).
Personally in the past I tried a few times to formalize some statements and sometimes I found that the mathlib libraries were pretty lacking in these more open-ended problems (I wanted to reason about lists and stuff). But it seems that I am just very bad at formalization lol.
But formalization is the easy part for humans. I'm sure every mathematician would be be happy if the only thing required to prove a result was to formalize it in Lean and feed it to the AI to find the proof.
Not sure every mathematician would be happy to do this... it sounds much less pleasant than thinking. It's like saying mathematicians would rather be programmers lol. It's a significant difficult problem which i believe should be left completely to AI. Human formalization should become dead
IIUC, a Gemini-based system could translate the natural language questions into Lean, but in the blog post they don’t really commit to whether this was done just to generate training data or was used in the competition.
Reading into the details, the system is more impressive than the title. 100% of the algebra and geometry problems were solved. The remaining problems are of combinatorial types, which ironically more closely resembles software engineering work.
AI is simply another form of what we've been doing since the dawn of computers: expressing real world problems in the form of computations.
While there are certainly some huge jumps in compute power, theory of data transformation and availability of data to transform, it would surprise me if computers in a 100 years do not still rely on a combination of well-defined and well-understood algorithms and AI-inspired tools that do the same thing but on a much bigger scale.
If not for any other reason, then because there are so many things where you can easily produce a great, always correct result simply by doing very precise, obvious and simple computation.
We've had computers and digital devices for a long while now, yet we still rely heavily on mechanical contraptions. Sure, we improve them with computers (eg. think brushless motors), but I don't think anyone would be surprised today about how did anyone design these same devices (hair dryers, lawn mowers, internal combustion engines...) before computers?
Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.
Everything that allows for cheap validation is going that way. Math, code, or things we can simulate precisely. LLM ideation + Validation is a powerful combination.
I'm still unclear whether the system used here is actually reasoning through the process of solving the problem, or brute forcing solutions with reasoning coming in during the mathematical proof of each potential proof.
Is it clear whether the algorithm is actually learning from why previously attempted solutions failed to prove out, or is it statistically generating potential answers similar to an LLM and then trying to apply reasoning to prove out the potential solution?
This is kind of an ideal use-case for AI, because we can say with absolute certainty whether their solution is correct, completely eliminating the problem of hallucination.
It would be nice if on the page they included detailed descriptions of the proofs it came up with, more information about the capabilities of the system and insights into the training process...
If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.
I found those, I just would have appreciated if the content of the mathematics wasn't sidelined to a separate download as if it's not important.
I felt the explanation on the page was shallow, as if they just want people to accept it's a black box.
All I've learnt from this is that they used an unstated amount of computational resources just to basically brute force what a human already is capable of doing in far less time.
Very few humans go after this type of the training. In my "math talent" school (most of the Serbian/Yugoslavian medal winners came from it), at most a dozen students "trained" for this over 4 high school generations (500 students).
Problems are certainly not trivial, but humans are not really putting all their effort into it either, and the few that do train for it, on average medal 50% of the time and get a silver or better 25% of the time (by design) with much less time available to do the problems.
This is disingenuous. People who train are already self selected people who are talented in math. And in the people who train not everyone gets to this level. Sadly i speak from personal experience.
This school is full of people talented at math — you can't get in if you don't pass a special math exam (looking at the list, out of Serbia's 16 gold medals, I can see 14 went to students of this school, and numerous silver and bronzes too — Serbia participates as an independent country since 2006 with a population of roughly 7M, if you want to compare it with other countries on the IMO medal table). So in general, out of this small pool (10 talented and motivated people out of 4 generations), Serbia could get a gold medal winner on average almost once every year. I am sure there were other equally talented mathematicians among the 490 students that did not train for the competition (and some have achieved more academic success later on).
Most students were simply not interested. And certainly, not everybody is equally talented, but the motivation to achieve competition success is needed too — perhaps you had the latter but not enough of the former. I also believe competitive maths is entirely different from research maths (time pressure, even luck is involved for a good idea to come up quickly, etc). Since you said you were a potential bronze medal winner, it might not even be a talent issue but maybe you just had great competition and someone had the better luck in one or two tests to rank above you (better luck as in the right idea/approach came to them quicker, or the type of problem that appeared on the test suited them more). And if you are from a large country like USA, China or Russia (topping the medal table), it's going to be freaking hard to get into a team since you'll have so many worthy students (and the fact they are not always scoring only golds for their teams out of such large pools tells me that the performance is not deterministic).
As a mathematician, I am sure you'd agree you'd want to run a lot more than a dozen tests to establish statistical significance for any ranking between two people at competitive maths IMO style, esp if they are close in the first few. As an anecdote, many at my school participated in national level maths and informatics competitions (they start at school level, go on to county/city level to nation level) — other than the few "trained" competitors staying at the top, the rest of the group mostly rotated in the other spots below them regardless of the level (school/county/nation). We've actually joked amongst ourselves about who had the better intuition "this time around" for a problem or two, while still beating the rest of the country handily (we've obviously had better base level of education + decently high base talent), but not coming close to "competitors".
I, for instance, never enjoyed working through math problems and math competitions (after winning a couple of early age local ones): I've finished the equivalent of math + CS MSc while skipping classes by only learning theory (reading through axioms, theorems and proofs that seemed non-obvious) and using that to solve problems in exams. I've mostly enjoyed building things with the acquired knowledge (including my own proofs on the spot, but mostly programming), even though I understood that you build up speed with more practice (I was also lazy :)).
So, let's not trivialize solving IMO-style problems, but let's not put them on a pedestal either. Out of a very small pool of people who train for it, many score higher than AI did here, and they don't predict future theoretical math performance either. Competition performance mostly predicts competition performance, but even that with large error bars.
To mathematicians the problems are basically easy (at least after a few weeks of extra training) and after having seen all the other AI advances lately I don't think it's surprising that with huge amounts of computing resources one can 'search' for a solution.
Sorry that's wrong. I have a math phd and i trained for Olympiads in high school. These problems are not easy for me at all. Maybe for top mathematicians who used to compete.
We need to up the ante: Getting human-like performance on any task is not impressive in itself, what matters is superhuman, orders of magnitude above. These comparisons with humans in order create impressive sounding titles are disguising the fact that we are still at the stone age of intelligence.
I'm curious if we'll see a world where computers could solve math problems so easily, that we'll be overwhelmed by all the results and stop caring. The role of humans might change to asking the computer interesting questions that we care about.
> As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise.
Wonder what "great promise" entails. Because it's hard to imagine Gemini and other transformer-based models solving these problems with reasonable accuracy, as there is no elimination of hallucination. At least in the generally available products.
They explicitly stated that to achieve the current results, they had to manually translate the problem statements into formal mathematical statements:
> First, the problems were manually translated into formal mathematical language for our systems to understand.
How I understand what they're saying is that they used gemini to translate the problem statement into formal mathematical language and let DeepMath do it's magic after that initial step.
Except it didn’t. The problem statements were hand-encoded into a formal language by human experts, and even then only one problem was actually solved within the time limit. So, claiming the work was “silver medal” quality is outright fraudulent.
I had exactly the same feeling when reading this blog. Sure, the techniques used to find the solutions are really interesting. But the claim more than they achieve. The problem statements are not available in Lean, and the time limit is 2 x 4.5 hours. Not 3 days.
The article claims they have another model that can work without formal languages, and that it looks very promising. But they don't mention how well that model performed. Would that model also perform at silver medal level?
Also note, that if the problems are provided in a formal language, you can always find the solution in finite amount of time (provided the solution exists). You can brute-force over all possible solutions until you find the solution that proofs the statement. This may take a very long time, but it will find the solutions eventually. You will always solve all the problems and win the IMO at gold medal level. Alphaproof seems to do something similar, but takes smarter decisions which possible solutions to try and which once to skip. What would be the reason they don't achieve gold?
It’s like bringing a rocket launcher to a fist fight but I’d like to use these math language models to find gaps in logic when people are making online arguments. It would be an excellent way to verify who has done their homework.
To what extent is the training and structure of AlphaProof tailored specifically to IMO-type problems, which typically have short solutions using combinations of a small handful of specific techniques?
(It's not my main point, but it's always worth remembering - even aside from any AI context - that many top mathematicians can't do IMO-type problems, and many top IMO medalists turn out to be unable to solve actual problems in research mathematics. IMO problems are generally regarded as somewhat niche.)
The last statement is largely correct (though idk what the imo medalists that are unable to solve actual problems most mathematicians can't solve most open problems). But i kind of disagree with the assessment of imo problems--the search space is huge if it were as you say it would be easy to search.
No, I don't mean that the search space is small. I just mean that there are special techniques which are highly relevant for IMO-type problems. It'd be interesting to know how important that knowledge was for the design and training of AlphaProof.
In other words, how does AlphaProof fare on mathematical problems which aren't in the IMO style? (As such exceptions comprise most mathematical problems)
So they weren't able to solve the combinatorics problem. I'm not super well versed in competition math, but combinatorics always seem to be the most interesting problems to me.
I mean, IMO algebra problems can require very clever insights as well, and number theory especially has some really nice proof arguments you can make. It's easier to make a bad problem of this category though because it's much easier to hide the difficulty in a bunch of computations / rote deduction, and not creative insights.
Combinatorics problems are usually simple enough that anyone can understand and try tackling it though, and the solutions in IMO are usually designed to be elegant. I don't think I've ever seen a bad combo problem before.
I honestly expected the IOI (International Olympiad of Informatics) to be "beaten" much earlier than the IMO. There's AlphaCode, of course, but on the latest update I don't think it was quite on "silver medal" level. And available LLM's are probably not even on "honourable mention" level.
I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).
Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.
IOI problems are more close to IMO combinatoric problems than other IMO problem types. That might be the reason for that delay. I personally like only combinatoric problems in IMO. Thats why I drop math track and went IOI instead.
I feel why combinatoric is harder for AI models is the same reason why LLM's are not great at reasoning anything out of distribution. LLM's are good pattern recognizers and fascinating at this point. But simple tasks like counting intersections at the Venn diagrams requires more strategy and less pattern recognition. Pure NN based models seem won't be enough to solve these problems. AI agents and RL are promising.
I don't know anything about lean but I am curious that proof of combinatorial problems can be as well represented as number theory or algebra. If combinatorial problem solutions are always closer to natural language, the failure of LLMs are expected. Or, at least we can assume it might take more time to make it better. I am making assumption in here that solutions of combinatorial problems in IMO are more human language oriented and relies on more common sense/informal logic when it compared to geometry or number theory problems.
>because of limitations in reasoning skills and training data
One would assume that mathematical literature and training data would be abundant. Is there a simple example that could help appreciate the Gemini bridge layer mentioned in the blog which produces the input for RL in Lean?
Are all of these specialized models available for use? Like, does it have an API?
I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them
> it’s hard to imagine why more than a handful of researchers would use them
If you could automatically prove that your concurrency protocol is safe, or that your C program has no memory management mistakes, or that your algorithm always produces the same results as a simpler, more obviously correct but less optimized algorithm, I think that would be a huge benefit for many programmers.
Can anyone comment on how different the AI generated proofs are when compared to those of humans? Recent chess engines have had some 'different' ideas.
If the system took 3 days to solve a problem, how different is this approach than a bruteforce attempt at the problem with educated guesses? Thats not reasoning in my mind.
Because with AlphaGeometry it literally was just a feedback loop brute forcing over a known database of geometry axioms with an LLM to guide the guesses.
Here, from what I understand, it's instead a theorem prover + LLM backing it. General proofs have a much larger search space than the 2d geometry problems you see on IMO; many former competitors disparage geometry for that reason.
Why is it so hard to make an AI that can translate an informally specified math problem (and Geometry isn't even so informal) into a formal representation?
> AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. … Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified for correctness.
Edit: To defend my response, the model definitely knows when it hasn't yet found a correct response, but this is categorically different from knowing that it does not know (and of course monkeys and typewriters etc., can always find a proof eventually if one exists).
6 months ago I predicted Algebra would be next after geometry. Nice to see that was right. I thought number theory would come before combinatorics, but this seems to have solved one of those. Excited to dig into how it was done
Can someone explain why proving and math problem solving is not a far easier problem for computers? Why does it require any “artificial intelligence” at all?
For example, suppose a computer is asked to prove the sum of two even numbers is an even number. It could pull up its list of “things it knows about even numbers”, namely that an even number modulo 2 is 0. Assuming the first number is “a” and the second is “b”, then it knows a=2x and b=2y for some x and y. It then knows via the distributive property that the sum is 2(x+y), which satisfies the definition of an even number.
What am I missing that makes this problem so much harder than applying a finite and known set of axioms and manipulations?
In a sense, the model _is_ simply applying a finite and known set of axioms and manipulations. What makes this hard in practice is that the number of possible ways in which to perform multiple steps of this sort of axiomatic reasoning grows exponentially with the length of the shortest possible solution for a given problem. This is similar to the way in which the tree of possible futures in games like go/chess grows exponentially as one tries to plan further into the future.
This makes it natural address these problems using similar techniques, which is what this research team did. The "magic" in their solution is the use of neural nets to make good guesses about which branches of these massive search trees to explore, and make good guesses about how good any particular branch is even before they reach the end of the branch. These tricks let them (massively) reduce the effective branching factor and depth of the search trees required to produce solutions to math problems or win board games.
The problems in question require much, much more complex proofs. Try example IMO problems yourself and see if they don't require much intelligence: https://artofproblemsolving.com/wiki/index.php/IMO_Problems_.... And then keep in mind that research math is orders of magnitude more complex still.
What you're missing is that this kind of thing has arbitrarily been declared "artificial intelligence" territory. Once the ability of computers to do has been established, it will no longer be artificial intelligence territory; at that point it'll just be another algorithm.
Proofs require a certain ingenuity that computers just don't have, imo. A computer would never be able to come up with something like Cantor's diagonalization proof on its own.
Techinically yes. And it's easy. You can probably do it with your PC's computational power.
The thing is that most math "problems" are not solved not becasue they're hard, but because they're not interesting enough to even be discovered by humans.
This is a fun result for AI, but a very disingenuous way to market it.
IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].
This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.
If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?
Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.
Working mathematicians mostly don't use theorem provers in their work, and find that when they do they go significantly more slowly (with of course the compensating advantage of guaranteeing no mistakes in the final result).
A theorem prover is probably more useful for the typical IMO problem than for the typical real research problem, but even so I'd guess that even with a reasonable amount of training most IMO contestants would not do much better for having access to a theorem prover.
Having three days would be a bigger deal, for sure. (But from "computers can't do this" to "computers can do this, but it takes days" is generally a much bigger step than from "computers can do this, but it takes days" to "computers can do this in seconds".)
I can tell you that as someone who could have gotten bronze (i was too weak for the team) and is now a math phd--I would not have scored as well as alphaproof in three days most likely. In most problems either you find an idea soon or it can be much much longer. It's just not a matter of working and constant progress.
Noting the difference in humility between someone that has made the cut to participate in the IMO (6 people per country) and the multitude of retrospect prophets that trivialize the DeepMind achievement.
They're literally comparing AI to human IMO contestants. "DeepProof solves 4/6 IMO problems correctly" would be the non-comparison version of this press release and would give a better sense for how it's actually doing.
"Solving IMO problems at Silver-Medal level" is pretty much equivalent to solving something like 4/6 problems. It is only a disingenious comparison if you want to read it as a comparison. I mean yea, many people will, but I don't care anout them. People who are technically interested in this know that the point is not to have a competition of AI with humans.
It's great that you feel safe being so aloof, but I believe we have a responsibility in tech to turn down the AI hype valve.
The NYT is currently running a piece with the headline "Move Over, Mathematicians, Here Comes AlphaProof". People see that, and people react, and we in tech are not helping matters by carelessly making false comparisons.
I think search-based AI is on a different level than imitation models like GPT. This is not a hallucinating model, and it could potentially discover things that are not written in any books.
Search is amazing. Protein folding searches for the lowest energy configuration. Evolution searches for ecological fit. Culture searches for progress, and science for understanding. Even placing a foot on the ground searches for dynamic equilibrium. Training a ML model searches for best parateters to fit a dataset. Search is universal. Supervised learning is just imitative, search is extrapolative.
Why? Why is hype bad? What actual harm does it cause?
Also the headline is fair, as I do believe that AlphaProof demonstrates an approach to mathematics that will indeed invade mathematicians workspaces. And I say that as a mathemstician.
For sure. I feel like mathematicians are not paying attention (maybe deliberately) we have a real shot of solving some incredibly hard open problem in the next few years.
And why aren't you complaining that human participants could train and study for thousands of hours before attempting the problems? And that the training materials they used was itself created and perfected by hundreds of other people, after having themselves spend countless hours studying?
In 1997, machines defeated a World Chess Champion for the first time, using brute-force "dumb search." Critics noted that while "dumb search" worked for chess, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]
In 2016, machines defeated a World Go Champion for the first time, using a clever form of "dumb search" that leverages compute, DNNs, reinforcement learning (RL), and self-play. Critics noted that while this fancy form of "dumb search" worked for Go, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]
In 2024, machines solved insanely hard math problems at the Silver Medal level in an International Math Olympiad for the first time, using a clever form of "dumb search" that leverages compute, DNNs, RL, and a formal language. Perhaps "dumb search" over cleverly pruned spaces isn't as dumb as the critics would like it to be?
The success in Go was very much not dumb search. Where dumb search had failed to achieve the level of even a very weak player, the neural net's "intuition" about good moves without any search was already very strong. Only the combination was superhuman, and that was anything but the dumb search that had been tried before.
Today's announcement is also not about proving Lean theorems by "dumb search". The success is about search + neural networks.
You're attacking critics for criticizing the solution that has failed, while confusing it for the solution that works to this day.
> My point is that you're attacking a position that no real critic holds
I'm not so sure. I wrote my comment after seeing quite a few comments on other threads here that read like fancy versions of "this is just brute-force search over cleverly pruned trees." Search the other threads here, and you'll see what I mean.
A brute force search has perfect knowledge. Calling it "dumb" encourages bad analogies -- it's "dumb" because it doesn't require advanced reasoning. It's also "genius" because it always gets the right answer eventually. It's hugely expensive to run.
And you keep shifting the goalpost on what's called "dumb" here.
By its nature hard math problems do not have fast algorithmic solutions--you can only solve them by search or clever search. Mathematicians have heuristics on helping us decide intuitively what's going to work what can make progress. But in the end--there is only search
High efficiency "search" is necessary to reach AGI. For example, humans don't search millions of potentially answers to beat ARC Prize puzzles. Instead, humans use our core experience to shrink the search space "intuitively" and deterministically check only a handful of ideas. I think deep-learning guided search is an incredibly promising research direction.
By "dumb" I assume you meant brute-force. Search, as opposed to extrapolation, is what actually produces suprise or creative results, whether it happens in a person's head or on a computer. The issue is to produce the heuristics that can let one push back the combinatorial explosion.
It kinda had to be this way, I think. There's a something from nothing problem. Douglas Adams brilliantly starts at this point.
We don't understand something from nothing. We don't even have the language to describe it. Concepts like "complexity" are frustratingly resistant to formalization.
"There is no free will." Has recently resurged as a philosophical position... like it did in response to Newton's mechanics.
Matter from void. Life from minerals. Consciousness from electrons. Free will from deterministic components. Smart reasoning & intelligent rationalisation from dumb search, computation, DNNs and such.
I don't think this debate was supposed to be ended by anything short of empirical demonstration.
Endnote: deep blue's victory over Gary was a bunch of preprogrammed bulls--t. Rematch!
If one of the lessons of LLMs was that much of "common sense" is covered in the written language corpus - that is, perhaps, basic human intelligence is covered by language.
Then, should we expect less with mathematics where written language is the normal way knowledge is propagated, and where formal proofs are wanted? An important distinction here is the coupling of search (not LLM for this one), a formal math language, and theorem proving. Math intelligence may not be merely the math written corpus, but adding the formal language and theorem proving sounds pretty powerful.
All this still lacks self-directed goals. An intention. For now that's taken care of by the human asking questions.
>that is, perhaps, basic human intelligence is covered by language.
It isn't. That's why LLMs are still terrible at basic reasoning tasks. The dataset doesn't contain the human intelligence, just the end results. Nobody is training their LLMs on brain scans.
> AI solves International Math Olympiad problems at silver medal level
> In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.
Why not compare with students who are given 3 days to submit an answer ? /s
This is more analogous to programmers working with copilot. There's an exciting possibility here of mathematicians feeding these systems subproblems to assist in proving larger theorums.
This means we may need to remove or replace the Olympiad..It has no practical significance..Winners never contributed to any major scientific breakthroughs.
"A number of IMO participants have gone on to become notable mathematicians. The following IMO participants have either received a Fields Medal, an Abel Prize, a Wolf Prize or a Clay Research Award, awards which recognise groundbreaking research in mathematics; a European Mathematical Society Prize, an award which recognizes young researchers; or one of the American Mathematical Society's awards (a Blumenthal Award in Pure Mathematics, Bôcher Memorial Prize in Analysis, Cole Prize in Algebra, Cole Prize in Number Theory, Fulkerson Prize in Discrete Mathematics, Steele Prize in Mathematics, or Veblen Prize in Geometry and Topology) recognizing research in specific mathematical fields. Grigori Perelman proved the Poincaré conjecture (one of the seven Millennium Prize Problems), and Yuri Matiyasevich gave a negative solution of Hilbert's tenth problem."
[...]
"IMO medalists have also gone on to become notable computer scientists. The following IMO medalists have received a Nevanlinna Prize, a Knuth Prize, or a Gödel Prize; these awards recognise research in theoretical computer science."
A lot of them become Fields medallists. From [1] "The conditional probability that an IMO gold medalist will become a Fields medalist is fifty times larger than the corresponding probability for a PhD graduate from a top 10 mathematics program."
(And with this comment, the 2024 Olympics commences.)
There are so many competitions that don't have any obvious practical significance. And people are still enjoying competitions where AI completely pwns humans.
I see DeepMind is still playing around with RL + search algorithms, except now it looks like they're using an LLM to generate state candidates.
I don't really find that this impressive. With enough compute you could just do n-of-10,000 LLM generations to "brute force" a difficult problem and you'll get there eventually.
IMO and Chess are the same in the most important respect, you can use Lean or a simulated chess game to create unlimited quality training labels. Any problem of this category should be solved with enough compute and clever architecture/metacognition design. The more intractable problems are where data is hard to find or synthesize.
I really don't understand what you mean by this. 1) it's not known whether chess is a win for White or not. 2) IMO problems, such as 2024 problem 1 which the system solved, are often phrased as "Determine all X such that…".
You are attacking a straw man and the point made is pretty good.
Competition problems are designed to be actually solvable by contestants. In particular, the problems should be solvable using a reasonable collection of techniques and many "prep courses" will teach you many techniques, tools and algorithms and a good starting point is to throw that stuff at any given problem. So just like chess openings putting in lots of leg work will give you some good results for that part. You might very well lose in mid and late game, just like this AI might struggle with "actual problems"
It is of course still very impressive, but that is an important point.
I'm attacking nobody! I literally couldn't understand the point, so I said so: as stated, its premises are simply clearly false!
Your point, however, is certainly a good one: IMO problems are an extremely narrow subset of the space of mathematical problems, which is itself not necessarily even 50% of the space of the work of a mathematician.
It isn't solved but the evaluation (which side is better, by how much, and which moves are best) of a strong engine is - for all practical purposes - an answer to every chess position you can pose to it. This makes it easy to gauge improvement and benchmark against other systems compared to some other problems.
Yes, sure, but this doesn't mean that this generalizes to open math research problems, which would be the useful real-world application of this. Otherwise this is just playing known games with known rules, granted better/faster than humans.
you realize this holds true for all of math right? outside of godel incompleteness potholes every proof/theorem is a permutation of ZFC. And you can fix the potholes by just filling them in with more Cs.
That seems shallow and not really useful. Like sorting is easy, just take all permutations and look at each one to see whether is sorted.... it just might you take longer than the heat death of the universe to actually do that.
I'm actually not that surprised. Maths Olympiads IME have always been 80% preparation, 20% skill - if not more heavily tuned to preparation. It was all about solving as many problems as possible ahead of the papers, and having a good short term memory. Since Olympiads are for kids, the amount of actual fundamental mathematical theorems required is actually not that great.
Sounds perfect for a GPT model, with lots of input training data (problem books and solutions).
The proofs of these problems aren't interesting. They were already known before the AI started work.
What's interesting is how the AI found the proof. The only answer we have is "slurped data into a neural network, matched patterns, and did some brute search".
What were the ideas it brainstormed? What were the dead-end paths? What were the "activations" where the problem seemed similar to a certain piece of input, which led to a guess of a step in the solution?
The result is (likely) net energy consumption, resulting in (likely) net CO2 emissions.
So, what was did it cost us for this achievement in AI?
EDIT TO ADD: It's fair to think that such a presser should not include answers to my questions. But, it's also fair to want that level of transparency given we are dealing with climate change.
You're not wrong and in general the conclusion is that AI emits less CO2 than a human performing the same task. Whether that's true for this specific task is worth asking, as is the question of how efficient such a process can be made.
It's that we are living in a period of time where there are very real consequences of nearly a century of unchecked CO2 due to human industry.
And AI (like crypto before it) requires considerable energy consumption. Because of which, I believe we (people who believe in AI) need to hold companies accountable by very transparently disclosing those energy costs.
> need to hold companies accountable by very transparently disclosing those energy costs.
And if they do, then what? If it is "too high" do we delay research because we need to keep the world how it is for you? What about all the other problems others face that could be solved by doubling down on compute for AI research?
> And if they do, then what? If it is "too high" do we delay research because we need to keep the world how it is for you?
First, it's keeping the world how it is for all of us, not just me.
Second, to answer you question, I think that is a decision for all of us to weigh in on, but before we can do that, we must be informed as best as we can.
Do sacrifices have to be made for the greater good? Absolutely. Do for-profit mega corporations get to make those decisions without consent from the public? No.
Many people don't want to live in the world how it is. They would rather see risks taken for accelerated progress. Stop trying to pretend your take is the humanitarian take.
I know this is not an uncommon opinion in tech circles, but I believe an insane thing to hang humanities hopes on. There's no reason to think AI will be omnipotent.
> First, the problems were manually translated into formal mathematical language for our systems to understand.
The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?