More

GregorStocks · 2026-02-18T00:59:37 1771376377

I'm focused on Constructed for now. Eventually I'd like to try stuff like sideboarding, deck selection, deckbuilding, and drafting, but I wanna get the harness to the limit of models' abilities in Constructed first.

reactordev · 2026-02-18T10:29:04 1771410544

Respect. A good Elo is key to deciding which to hold in a draft. It’s very interesting because the mechanics of MTG make it so that there’s a whole “decision tree” on each players turn, across turns, and for the match.

“To cast or not to cast, that is the question” - MTG player with a non-blockable instant.

GregorStocks · 2026-02-18T00:28:58 1771374538

Oh, that's a good bug report - historically it was just hallucinating card effects so I made the harness throw the Oracle text for all visible cards into the context, but I bet I forgot to do that for the mulligan decision specifically (it's a weird one). Thanks!

GregorStocks · 2026-02-17T23:12:30 1771369950

A lot of models (including Opus) keep insisting in their reasoning traces that going first can be a bad idea for control decks, etc, which I find pretty interesting - my understanding is that the consensus among pros is closer to "you should go first 99.999% of the time", but the models seem to want there to be more nuance. Beyond that, most of the really interesting blunders that I've dug into have turned out to be problems with the tooling (either actual bugs, or MCP tools with affordances that are a poor fit for how LLMs assume they work). I'm hoping that I'm close to the end of those and am gonna start getting to the real limitations of the models soon.

GregorStocks · 2026-02-17T23:05:43 1771369543

To be clear, that's not estimated price, it's actual price I paid across all the real games. My hope is you'll see it trend down over time as I find more ways to make the harness token-efficient :)

jedberg · 2026-02-17T23:09:55 1771369795

That's even more interesting then! It would be cool if you added a price to performance column. Even if it's just for this one task, it's still interesting.

GregorStocks · 2026-02-18T00:40:56 1771375256

Performance is tricky to measure. Right now the best measure of performance I've got is the "blunder index", but that's currently flagging a lot of stuff that I really don't consider to be true blunders - I think my top priority for the next few evenings is going to be iterating on the blunder-annotator, and that'll help me identify what issues in the actual gameplay code to focus on. And the blunder index isn't really defined in such a way that you can do arithmetic on it meaningfully :)

GregorStocks · 2026-02-17T21:53:06 1771365186

You still need an algorithm to decide, for each game that you're simulating, what actual decisions get made. If that algorithm is dumb, then you might decide Mono-Red Burn is the best deck, not because it's the best deck but because the dumb algorithm can play Burn much better than it can play Storm, inflating Burn's win rate.

In principle, LLMs could have a much higher strategy ceiling than deterministic decision-tree-style AIs. But my experience with mage-bench is that LLMs are probably not good enough to outperform even very basic decision-tree AIs today.

deadbabe · 2026-02-17T22:20:11 1771366811

Um obviously the Monte Carlo results would be use to generate utility AI scoring functions to determine the best card to use for different considerations. Have the people building these LLM AI systems even had experience with classical AIs!? This is a solved problem, the LLM solution is slow, expensive, and energy inefficient.

Worse, it’s difficult to tweak. For example, what if you want AIs that play at varying difficulties? Are you just gonna prompt the LLM “hey try to be kinda shitty at this but still somewhat good”?

GregorStocks · 2026-02-17T21:48:08 1771364888

The anxiety is coming from the "worrier" personality. Players are combination of a model version + a small additional "personality" prompt - in this case (https://mage-bench.com/games/game_20260217_075450_g8/), "Worrier". That's why the player name is "Haiku Worrier". The personality is _supposed_ to just impact what it says in chat (not its internal reasoning), but I haven't been able to make small models consistently understand that distinction so far.

The Gran-Gran thing looks more like a bug in my harness code than a fundamental shortcoming of the LLM. Abilities-on-the-stack are at the top of my "things where the harness seems pretty janky and I need to investigate" list. Opus would probably be able to figure it out, though.

Imnimo · 2026-02-17T22:07:43 1771366063

Ha! I misread it as "Haiku Warrior" and so didn't make the connection. That makes a lot more sense!

GregorStocks · 2026-02-17T20:17:41 1771359461

Oh, fascinating - I didn't realize they released actual replay data publicly. It doesn't look like it's quite as rich as I'd like, though - it only captures one row per turn, so I don't think you can deduce things like targeting, the order in which spells are cast, etc.

(I also thought about pointing it at my personal game logs, but unfortunately there aren't that many, because I'm too busy writing analysis tools to actually play the game.)

HanClinto · 2026-02-18T19:49:03 1771444143

Another thing that I've thought about doing is to use some sort of computer vision to watch streamers of online games and use STT to capture not just play datasets, but also datasets of their narrated reasoning about why they play what they play.

Would be a lot of work to go through and use computer vision and some measure of reasoning to create these datasets, but some players do an excellent job of narrating their reasoning for their players (thinking of players like Cheon or LSV), so would be fascinating.

Caleb Gannon [0] is one such streamer who does a good job of narrating his plays, and he's also a computer scientist who is very interested in machine-learning projects (he's done several of his own). If you contacted him, I could definitely see him being willing to consent to his videos being used as a fine-tuning dataset for such purposes.

I would be willing to help with creating this dataset if you helped me understand what you would like to see in the final output format.

[0] - https://www.youtube.com/watch?v=YmAAK3V13b0

GregorStocks · 2026-02-19T00:43:43 1771461823

Down the road I can definitely imagine being interested in that (basically split out the "web-based replay viewer" part from the "LLM harness that I want to debug with a replay viewer" part, and then ingest non-LLM games into the viewer), but for now they're super entangled and I'm not prioritizing separating them cleanly. I'll definitely keep this offer in mind for the future, thanks!

HanClinto · 2026-02-18T19:44:09 1771443849

I believe it's even possible to match up game IDs so that (hypothetically) if both players are using 17 Lands, then you can match up a game from both sides and get full information re: the hands of each player as well.

It obviously wouldn't be the full set of games (because not everyone uses 17 lands), but it would certainly be a nonzero dataset.

GregorStocks · 2026-02-17T19:14:42 1771355682

Yeah, the intention here is not to answer "which deck is best" - the standard of play is nowhere near high enough for that. It's meant as more of a non-saturated benchmark for different LLM models, so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old". I'm optimistic that with continued improvements to the harness and new model releases we can get to at least "official Pro Tour stream commentator" skill levels within the next few years.

danielvinson · 2026-02-17T23:47:04 1771372024

Hmm well, from my perspective, none of them are even really playing the game, they are just taking random actions. Any human, even a small child, would be much better.

And re: ages, it's worth noting that the youngest player to make Day 2 of a Grand Prix is 8 years old, and the youngest Pro Tour winner was 15 years old. I don't think it's realistic to get an LLM anywhere close to either of those players in skill level, though it's absolutely possible with a specialized model.

mistrial9 · 2026-02-17T21:16:47 1771363007

> , so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old".

no, no, no.. please think. Human child psychology is not the same as an LLM engine rating. It is both inaccurate and destructive to actual understanding to say that common phrase. Asking politely - consider not saying that about LLM game ratings.

GregorStocks · 2026-02-17T18:47:42 1771354062

I was really hoping I could build this on top of MTGO or Arena, just as a bot interacting with real Wizards APIs and paying the developers money. But they've got very strong "absolutely no bots" terms of service, and my understanding is that outside of the special case of MTGO trading bots they're strongly enforced with bans. I assume their reasoning is that people do not want to get matched against bot players in tournaments, which is totally fair. (Also I'm not sure MTGO's infrastructure could handle the load of bot users...)

ddtaylor · 2026-02-17T19:16:10 1771355770

I ran a bot for years that I wrote using Java in a few minutes and they never came at me. It just joined a match and played lands 24/7 and won games every once in a while because people leave games randomly. It technically played all colors and some of the trinkets count as spells, etc. This allowed me to never do any of their lootbox like mechanics or other predatory practices.

Regarding actually doing it under the radar there are a lot of ways. They likely are catching most of the players because they create synthetic events using the Windows API and similar, which is also part of the same system being used for CAPTCHAS that are being used to stop web scraping like the kind that just ask for a button press.

This can be worked around by using a fake mouse driver that is actually controlled by software if you must stay on Windows. It can be worked around by just running the client on Linux as well. It can also he worked around using qemu as the client and using its native VNC as those are hardware events too =)

GregorStocks · 2026-02-17T19:23:21 1771356201

Well, it's hard to do it under the radar if I'm posting it on HackerNews :) I've put enough money into MTGO (and, sigh, Arena) that I don't want to roll the dice on a ban.

ddtaylor · 2026-02-17T19:58:27 1771358307

That makes sense. I play Arena a bit, but have always rejected the monetization model of not allowing players to pick what cards they want easily or play with proxies or something similar for casual friend games. I have absolutely no interest in their competitive game modes. I was slightly interested in the idea in the early days of buying boosters and getting arena codes, but they messed that up pretty bad and paper magic as a whole has been turned into a game of milking whales similar to predator mobile games or apps. The end result is Arena is something I will jump on to fool around sometimes every few months and remember why I don't want a second part time job.

GregorStocks · 2026-02-17T18:42:10 1771353730

Yeah, that's why I'm using XMage for my project - it has real rules enforcement.

spullara · 2026-02-17T19:44:31 1771357471

I was really hoping they could play the game like a human does. Sadly they aren't that close :)