Single-file Python agent (zero dependencies) that uses llama.cpp for local inference. Runs on a 2013 Mac Pro with a Xeon E5-1650 v2 and dual FirePro D500 GPUs. Qwen 3B does native tool calling at 15.6 tok/s. Also includes a 3-line patch to fix llama.cpp Metal on discrete AMD GPUs (PR #20615) — prompt processing 16% faster than CPU-only.
Delivered. please reconsider now. AI slop cannot build this without a human who has real risc cpu knowledge.
The Emulator ---------------------------------------------- https://bottube.ai/watch/shFVLBT0kHY
Ok I promised videos here is two. LLM had serious head issues with C and python x86 versus mips c. now coherent english. Phase two is chat interface so we can prompt without seeded prompts, check the code its real inference though!
The Emulator ---------------------------------------------- https://bottube.ai/watch/shFVLBT0kHY
This feels like an AI agent doing it's own thing. The screenshot of this working is garble text (https://github.com/sophiaeagent-beep/n64llm-legend-of-Elya/b...), and I'm skeptical of reasonable generation with a small hard-coded training corpus. And the linked devlog on youtube is quite bizzare too.
Ok I promised videos here is two. LLM had serious head issues with C and python x86 versus mips c. now coherent english. Phase two is chat interface so we can prompt without seeded prompts, check the code its real inference though!
819K parameters. Responses are short and sometimes odd. That's expected at this scale with a small training corpus. The achievement is that it runs at all on this hardware.
Context window is 64 tokens. Prompt + response must fit in 64 bytes.
No memory between dialogs. The KV cache resets each conversation.
Byte-level vocabulary. The model generates one ASCII character at a time.
Future Directions
These are things we're working toward — not current functionality:
RSP microcode acceleration — the N64's RSP has 8-lane SIMD (VMULF/VMADH); offloading matmul would give an estimated 4–8× speedup over scalar VR4300
Larger model — with the Expansion Pak (8MB total), a 6-layer model fits in RAM
Richer training data — more diverse corpus = more coherent responses
Real cartridge deployment — EverDrive compatibility, real hardware video coming
Why This Is Real
The VR4300 was designed for game physics, not transformer inference. Getting Q8.7 fixed-point attention, FFN, and softmax running stably at 93MHz required:
Custom fixed-point softmax (bit-shift exponential to avoid overflow)
Q8.7 accumulator arithmetic with saturation guards
Soft-float compilation flag for float16 block scale decode
Alignment-safe weight pointer arithmetic for the ROM DFS filesystem
The inference code is in nano_gpt.c. The training script is train_sophia_v5.py. Build it yourself and verify.
Partially correct. The value is not the game interface right now. Its proof you can do actual inference on an LLM the surprise I am developing is a bit bigger than this, just have to get the llm outputs right first!
You’re right that the graphics layer is mostly 2D right now. Sprites are hardware-accelerated where it makes sense, and text is written directly to the framebuffer. The UI is intentionally minimal.
The point of this ROM wasn’t the game interface — it was proving real LLM inference running on-device on the N64’s R4300i (93 MHz MIPS, 4MB RDRAM).
Since the original screenshots, we’ve added:
• Direct keyboard input
• Real-time chat loop with the model
• Frame-synchronous generation (1–3 tokens per frame @ 60 FPS)
So it’s now interactive, not just a demo render.
The current focus is correctness and stability of inference. The graphics layer can evolve later.
Next step is exposing a lightweight SDK layer so N64 devs can hook model calls into 3D scenes or gameplay logic — essentially treating the LLM as a callable subsystem rather than a UI gimmick.
The value isn’t the menu.
It’s that inference is happening on 1996 silicon.
Happy to answer specifics about the pipeline if you’re interested.
Uploading weights.bin its really meant for you to generate your own llm but we are uplaoding it. They are ripping on it but they didnt check the code themselves. THis is a tech demo. its not about graphics its about the llm is inferring on the hardware lol.
This is the text inference issues I was alluding to. We had several hurdles to overcome. 1 llms were trained on little endian. Mips for n64 is big endian. 2 we had python to c issues. 3 we had quantization issues. all being resolved.
This is a tech demo to honor LOZ and also the code can be used for n64 devs to add ai style npcs in the future. So did we achieve it yes we are the first to do llm inference on n64. I am just trying to give you guys the proper video.
reply