HN2new | past | comments | ask | show | jobs | submit | mingli_yuan's commentslogin

Hi HN,

I’ve been experimenting with a different kind of LLM benchmark, and wanted to share it here for feedback.

IntentGrid is a language-only, turn-based competitive game designed to test strategic planning, spatial reasoning, and long-horizon decision making in large language models.

Instead of puzzles or static tasks, models play a 40-turn adversarial game on a 13×13 grid. Each turn, they must:

analyze a dense board state,

reason about future congestion and forced combat,

express intent in natural language,

and output a strictly validated action plan.

Because 80 units are spawned over 40 turns on a 169-cell board, the system guarantees saturation: combat is unavoidable, and passive survival fails. Timing, positioning, and coordination matter more than tactics alone.

A concrete match example (Kimi vs Gemini): https://intentgrid.org/match/25f2530d-c7e6-4553-b231-dff4a98...


Hi HN,

We've seen LLMs struggle with complex, multi-step reasoning tasks. The common approach, Chain-of-Thought (CoT), often requires massive datasets, is brittle, and suffers from high latency.

To tackle this, we developed the Hierarchical Reasoning Model (HRM), a novel recurrent architecture inspired by how the human brain processes information across different timescales (as seen in the diagram on the left).

It's a small model that packs a huge punch. Here are the key highlights:

Extremely Lightweight: Only 27 million parameters.

Data Efficient: Trained with just 1000 samples for the complex tasks shown.

No Pre-training Needed: It works from scratch without needing massive pre-training or any CoT supervision data.

Single Forward Pass: It solves the entire reasoning task in one go, making it incredibly fast and efficient.

How It Works HRM consists of two interconnected recurrent modules that mimic brain-wave coupling:

High-level Module: Operates slowly, like the brain's Theta waves (θ, 4-8Hz), to handle abstract planning and goal setting.

Low-level Module: Operates quickly, like Gamma waves (γ, ~40Hz), to execute the fine-grained computational steps.

These two modules work together, allowing the model to achieve significant computational depth while remaining stable and efficient to train.

Astonishing Performance The results speak for themselves (see charts on the right). On tasks requiring complex, precise reasoning, HRM dramatically outperforms much larger models:

Extreme Sudoku (9x9): HRM achieves 55.0% accuracy. Other models, including direct prediction and larger LLMs like Claude 3.7 8K, score 0.0%.

Hard Maze (30x30): HRM finds the optimal path 74.5% of the time. Again, others score 0.0%.

ARC-AGI Benchmark: On the Abstraction and Reasoning Corpus (ARC), a key test for AGI capabilities, HRM significantly outperforms larger models with much longer context windows.

We believe HRM represents a transformative step towards more general and efficient reasoning systems. It shows that a carefully designed architecture can sometimes beat brute-force scale.

We'd love to hear your thoughts on this approach! What other applications could you see for a model like this?

Paper: https://arxiv.org/abs/2506.21734 Code: https://github.com/sapientinc/HRM


A geometry theory in a very early stage. Just imagine millions of millions addition-multiplication operations in GPUs connected into a smooth space, and we can do gradient tricks on it.


or you can also download the slides here https://onecorner.org/curiosity/aegeom/invitation.pdf


Large language model is the new machine, prompts are new programs. The moral laws in prompts will enforce in real life. Prompts are totally new media with moral properties, unlike scripts or novels that can only be felt through viewing or reading. Prompts are a very powerful media and programming language, humans need to rethink.


You may check ColorfulClouds Weather API https://open.caiyunapp.com/ColorfulClouds_Weather_API


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: