HN2new | past | comments | ask | show | jobs | submit | montebicyclelo's commentslogin

Reminds me of this [1] HN post from 9 months ago, where the author trained a neural network to do world emulation from video recordings of their local park — you can walk around in their interactive demo [2].

I don't have access to the DeepMind demo, but from the video it looks like it takes the idea up a notch.

(I don't know the exact lineage of these ideas, but a general observation is that it's a shame that it's the norm for blog posts / indie demos to not get cited.)

[1] https://hackernews.hn/item?id=43798757

[2] https://madebyoll.in/posts/world_emulation_via_dnn/demo/


Yup, similar concepts! Just at two opposite extremes of the compute/scaling spectrum.

- That forest trail world is ~5 million parameters, trained on 15 minutes of video, scoped to run on a five-year-old iPhone through a twenty-year old API (WebGL GPGPU, i.e OpenGL fragment shaders). It's the smallest '3D' world model I'm aware of.

- Genie 3 is (most likely) ~100 billion parameters trained on millions of hours of video and running across multiple TPUs. I would be shocked if it's not the largest-scale world model available to the public.

There are lots of neat intermediate-scale world models being developed as well (e.g. LingBot-World https://github.com/robbyant/lingbot-world, Waypoint 1 https://huggingface.co/blog/waypoint-1) so I expect we'll be able to play something of Genie quality locally on gaming GPUs within a year or two.


I was immediately struck when I looked down at just the boardwalk how similar it felt to being on LSD. I am continually astounded with how similar these systems end up seeming to how our brain works. May just be happy coincidences but I am pretty sold on there being true parallels that will only become more and more apparent.

A lot of people mentioned this! The "dreamlike" comparison is common as well. In both cases, you have a network of neurons rendering an image approximating the real world :) so it sort of makes sense.

Regarding the specific boiling-textures effect: there's a tradeoff in recurrent world models between jittering (constantly regenerating fine details to avoid accumulating error) and drifting (propagating fine details as-is, even when that leads to accumulating error and a simplified/oversaturated/implausible result). The forest trail world is tuned way towards jittering (you can pause with `p` and step frame-by-frame with `.` to see this). So if the effect resembles LSD, it's possible that LSD applies some similar random jitter/perturbation to the neurons within your visual cortex.


Originally wrote this in 2022 to improve my understanding of deep learning internals. Recently refreshed it, removing CuPy to keep it lighter and educational. (Plus modernized the stack, e.g. with uv and improved CI.)


For an example of an AI generated song that's gone viral in the last few days, getting millions of views on Spotify / Youtube, see this post from earlier today:

"Tell HN: Viral Hit Made by AI, 10M listens on Spotify last few days" [1]

[1] https://hackernews.hn/item?id=46600681


This is quite sad, in that the fact that it's AI generated is hidden in the disclaimer and nearly all of the comments think it's a real human.


Yep, it is convincing. And it seems the video is performed by a human, although I did think it could be AI. They say in the description:

> I couldn't resist putting a face to this gem.

After saying it is AI:

> How people was touched by this version of « Papoutai » by stromae made by AI, I was also touched as y’all, as an independent Artist I wanted to put my emotions and my soul on this masterpiece

https://www.youtube.com/watch?v=bQ8GbwQV5zE


as an independent Artist I wanted to put my emotions and my soul on this masterpiece

Barf


As a hobbyist, shaders is up there as one of the most fun types of programming.. Low-level / relatively simple language, often tied to a satisfying visual result. Once it clicks, it's a cool paradigm to be working in, e.g. "I am coding from the perspective of a single pixel".


I found them fun once they work, but if something did not work, debugging them I did not enjoy so much.


Nothing like outputting specific colors to see what branch the current pixel is currently running through. It's like printf debugging but colorful and with only three floats of output.


Well in the GL/Vulkan world there's finally functionality now in recent years for printf output from shaders, which finally fixes the issue. I'd assume DirectX probably also has something similar but I don't work with it so I don't know.


Hm .. I just have limited experience with WebGPU so far, but since that is still highly unstable and I really would like a printf functionality and all of the performance possible, I guess I should rather invest all my learning efforts towards Vulkan. Thanks for the hint.


I agree it’s very difficult to debug them. I sometimes rewrite my shaders in Vex and debug them in that. It’s a shader language that runs on the CPU in Houdini. You can output a value at each pixel which is useful for values outside the range of 0 to 1 or you can use printf(). I’m still looking for something that will transpile shaders into JavaScript.


It’s interesting that this hasn’t been solved for pixel shaders. With HIP in the GPGPU world I’m able to set breakpoints in a GPU kernel and step through line by line. I can also add printf statements to output values to the console.


You can do all that with Vulkan and RenderDoc.


    bulk_send(
        generate_expiry_email(user) 
        for user in db.getUsers() 
        if is_expired(user, date.now())
    )
(...Just another flavour of syntax to look at)


The nice thing with the Elixir example is that you can easily `tap()` to inspect how the data looks at any point in the pipeline. You can also easily insert steps into the pipeline, or reuse pipeline steps. And due to the way modules are usually organized, it would more realistically read like this, if we were in a BulkEmails module:

  Users.all()
  |> Enum.filter(&Users.is_expired?(&1, Date.utc_today()))
  |> Enum.map(&generate_expiry_email/1)
  |> tap(&IO.inspect(label: "Expiry Email"))
  |> Enum.reject(&is_nil/1)
  |> bulk_send()
The nice thing here is that we can easily log to the console, and also filter out nil expiry emails. In production code, `generate_expiry_email/1` would likely return a Result (a tuple of `{:ok, email}` or `{:error, reason}`), so we could complicate this a bit further and collect the errors to send to a logger, or to update some flag in the db.

It just becomes so easy to incrementally add functionality here.

---

Quick syntax reference for anyone reading:

- Pipelines apply the previous result as the first argument of the next function

- The `/1` after a function name indicates the arity, since Elixir supports multiple dispatch

- `&fun/1` expands to `fn arg -> fun(arg) end`

- `&fun(&1, "something")` expands to `fn arg -> fun(arg, "something") end`


Not sure I like how the binding works for user in this example, but tbh, I don't really have any better idea.

Writing custom monad syntax is definitely quite a nice benefit of functional languages IMO.


Incorrect Pytorch gradients with Apple MPS backend...

Yep this kind of thing can happen. I found and reported incorrect gradients for Apple's Metal-backed tensorflow conv2d in 2021 [1].

(Pretty sure I've seen incorrect gradients with another Pytorch backend, but that was a few years ago and I don't seem to have raised an issue to refer to... )

One might think this class of errors would be caught by a test suite. Autodiff can be tested quite comprehensively against numerical differentiation [2]. (Although this example is from a much simpler lib than Pytorch, so I could be missing something.)

[1] https://github.com/apple/tensorflow_macos/issues/230

[2] https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...


I’ve also found that some versions of torch get quite different inference results on MPS, ignoring gradient. See https://gist.github.com/gcr/4d8833bb63a85fc8ef1fd77de6622770


Yeah, luckily, you can unit tests these and fix them. They are not concurrency bugs (again, luckily).

BTW, numeric differentiation can only be tested very limitedly (due to algorithmic complexity when you doing big matrix). It is much easier / effective to test against multiple implementations.


You can easily test a gradient using only the forward pass by doing f(x+h) ~ f(x) + dot(g, h) for a random h


Agreed, and its larger context window is fantastic. My workflow:

- Convert the whole codebase into a string

- Paste it into Gemini

- Ask a question

People seem to be very taken with "agentic" approaches were the model selects a few files to look at, but I've found it very effective and convenient just to give the model the whole codebase, and then have a conversation with it, get it to output code, modify a file, etc.


I usually do that in a 2 step process. Instead of giving the full source code to the model, I will ask it to write a comprehensive, detailed, description of the architecture, intent, and details (including filenames) of the codebase to a Markdown file.

Then for each subsequent conversation I would ask the model to use this file as reference.

The overall idea is the same, but going through an intermediate file allows for manual amendments to the file in case the model consistently forgets some things, it also gives it a bit of an easier time to find information and reason about the codebase in a pre-summarized format.

It's sort of like giving a very rich metadata and index of the codebase to the model instead of dumping the raw data to it.


My special hack on top of what you suggested: Ask it to draw the whole codebase in graphviz compatible graphing markup language. There are various tools out there to render this as an SVG or whatever, to get an actual map of the system. Very helpful when diving in to a big new area.


You can use mermaid format instead of graphviz, then paste it into a markdown file and github will render it inline.


For anyone wondering how to quickly get your codebase into a good "Gemini" format, check out repomix. Very cool tool and unbelievably easy to get started with. Just type `npx repomix` and it'll go.

Also, use Google AI Studio, not the regular Gemini plan for the best results. You'll have more control over results.


> Convert the whole codebase into a string

When using the Gemini web app on a desktop system (could be different depending upon how you consume Gemini) if you select the + button in the bottom-left of the chat prompt area, select Import code, and then choose the "Upload folder" link at the bottom of the dialog that pops up, it'll pull up a file dialog letting you choose a directory and it will upload all the files in that directory and all subdirectories (recursively) and you can then prompt it on that code from there.

The upload process for average sized projects is, in my experience, close to instantaneous (obviously your mileage can vary if you have any sort of large asset/resource type files commingled with the code).

If your workflow already works then keep with it, but for projects with a pretty clean directory structure, uploading the code via the Import system is very straightforward and fast.

(Obvious disclaimer: Depending upon your employer, the code base in question, etc, uploading a full directory of code like this to Google or anyone else may not be kosher, be sure any copyright holders of the code are ok with you giving a "cloud" LLM access to the code, etc, etc)


Well I am not sure Gemini or any other LLMs respect `.gitignore` which can immediately make the context window jump over the maximum.

Tools like repomix[0] do this better, plus you can add your own extra exclusions on top. It also estimates token usage as a part of its output but I found it too optimistic i.e. it regularly says "40_000 tokens" but when uploading the resulting single XML file to Gemini it's actually f.ex. 55k - 65k tokens.

[0] https://github.com/yamadashy/repomix/


I agree. I use repomix with AI Studio extensively and never found anything (including the cli agents) that's close.

I sometimes upload codebases that are around 600k tokens and even those work.

Repomix also lets you create a config file so you can give it ignore/include patterns in addition to .gitignore.

It also tells you about the outlier files with exceptionally long content.


try codex and claude code - game changing ability to use CLI tools, edit/reorg multiple files, even interact with git.


Gemini cli is a thing that exists. Are you saying those specifically are better? Or CLIs are better?


OpenAI Codex currently seems quite a lot better than Gemini 2.5 and marginally better than Claude.

I'm using all three back-to-back via the VS Code plugins (which I believe are equivalent to the CLI tools).

I can live with either OpenAI Codex or Claude. Gemini 2.5 is useful but it is consistently not quite as good as the other two.

I agree that for non-Agentic coding tasks Gemini 2.5 is really good though.


Since I have only used Gemini Pro 2.5 (free) and Claude on the web (free) and I am thinking of subbing to one service or two, are you saying that:

- Gemini Pro 2.5 is better at feeding it more code and ask it to do a task (or more than one)? - ...but that GPT Codex and Claude Code are better at iterating on a project? - ...or something else?

I am looking to gauge my options. Will be grateful for your shared experience.


Codex and Claude are better than Gemini in all coding tasks I've tried.

At the "smart autocomplete" level the distinction isn't large but it gets bigger the more agentic you ask for.


Gemini CLI does all this too


I started using gemini like that as well, but with gemini cli. Point it at the direction and then converse with it about codebase. It's wonderful.


Idk though, I've seen many issues occur because of a longer context though. I mean it makes sense, given there are only so many attention heads, the longer the context the less chance attention will pick relevant tokens.


the cli tools really are way faster. You can use them the same way if you want you just dont have to copy paste stuff around all the time


On the contrary; now might be a good time to get an M1 Max laptop. A second hand one, ex-corporate, in good condition, with 64Gb RAM, is pretty good value, compared to new laptops at the same price. It's still a fantastic CPU.


That's what I did, bought a used one with 64GB and a dent in the back for ~$1k a year back or so. Some of the best money i've ever spent.


Where would one look for ex-corporate MacBook pros?


At your own risk — one place is ebay sellers with a large number of positive reviews, (and not much negative), who are selling lots of the same type of MacBook pros. My assumption is they've got a bunch of corporate laptops to sell off.


Honestly the only Apple Silicon e-waste has been their 8GB models. And even those are still perfectly good for most people so long as they use Safari rather than Chrome.


Does Safari use less RAM?


Data maybe somewhat dated and I haven’t measured it myself but,

“Per his findings, Chrome used 290MB of RAM per open tab, while Safari only used 12MB of RAM per open tab.”

https://www.macrumors.com/2021/02/20/chrome-safari-ram-test/


> nanochat is also inspired by modded-nanoGPT

Nice synergy here, the lineage is: Karpathy's nano-GPT -> Keller Jordan's modded-nanoGPT (a speedrun of training nanoGPT) -> NanoChat

modded-nanoGPT [1] is a great project, well worth checking out, it's all about massively speeding up the training of a small GPT model.

Notably it uses the author's Muon optimizer [2], rather than AdamW, (for the linear layers).

[1] https://github.com/KellerJordan/modded-nanogpt

[2] https://kellerjordan.github.io/posts/muon/


Muon was invented by Keller Jordan (and then optimized by others) for the sake of this speedrunning competition. Even though it was invented less than a year ago, it has already been widely adopted as SOTA for model training


This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).

Both share equal credit I feel (also, the paper's co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.

(Source: am experienced speedrunner who's been in these circles for a decent amount of time)


I think it's good to bring up Bernstein & Newhouse as well as Yuchen Jin, Jiacheng You and the other speedrunners who helped iterate on Muon. But I think it's very fair to call Keller Jordan the main author of Muon of its current form. I'm also in the speedrunning community though maybe not as long as you have


sharing some useful resrources for learning Muon (since I'm also just catching up on it)

- https://x.com/leloykun/status/1846842883967692926

- https://www.yacinemahdid.com/p/muon-optimizer-explained-to-a...


This Simple Optimizer Is Revolutionizing How We Train AI [Muon]

https://www.youtube.com/watch?v=bO5nvE289ec

I found the above video as a good introduction.


The most exciting thing about Muon for me is that it requires half the state of Adam while having either equivalent or better performance. That's amazing if you are VRAM limited! And just like Adam, you can also quantize it. I can get it to work relatively well as low as 4-bit, which essentially cuts down the memory requirements from full 32-bit Adam by a factor of 16x! (And by a factor of 4x vs 8-bit Adam).


I haven't heard of this before. Has Muon dethroned Adam and AdamW as the standard general purpose optimizer for deep learning?


It's for hidden layers and not for every parameter: From Keller's Muon github page:

"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."

And I just looked into this nanochat repo and it's also how it's used here.

https://github.com/karpathy/nanochat/blob/dd6ff9a1cc23b38ce6...


8xH100 is pretty wild for a single inference node.

Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?

At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.

Is my math wrong?


This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.


The default model is ~0.5B params right?


As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: