Hacker News new | past | comments | ask | show | jobs | submit login
Lm.rs: Minimal CPU LLM inference in Rust with no dependency (github.com/samuel-vitorino)
297 points by littlestymaar 1 day ago | hide | past | favorite | 74 comments





This is impressive. I just ran the 1.2G llama3.2-1b-it-q80.lmrs on a M2 64GB MacBook and it felt speedy and used 1000% of CPU across 13 threads (according to Activity Monitor).

    cd /tmp
    git clone https://github.com/samuel-vitorino/lm.rs
    cd lm.rs
    RUSTFLAGS="-C target-cpu=native" cargo build --release --bin chat
    curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/tokenizer.bin?download=true'
    curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
    ./target/release/chat --model llama3.2-1b-it-q80.lmrs

Not sure how to formulate this, but what does this mean in the sense of how "smart" it is compared to the latest chatgpt version?

The model I'm running here is Llama 3.2 1B, the smallest on-device model I've tried that has given me good results.

The fact that a 1.2GB download can do as well as this is honestly astonishing to me - but it's going to laughably poor in comparison to something like GPT-4o - which I'm guessing is measured in the 100s of GBs.

You can try out Llama 3.2 1B yourself directly in your browser (it will fetch about 1GB of data) at https://chat.webllm.ai/


anyone else think 4o is kinda garbage compared to the older gpt4? as well as o1-preview and probably o1-mini.

gpt4 tends to be more accurate than 4o for me.


I sort of do, especially against OG GPT-4 (before turbo)

4o is a bit too lobotomized for my taste. If you try to engage in conversation, nearly every answer after the first starts with "You're absolutely right". Bro, I don't know if I'm right, that's why I'm asking a question!

It's somehow better in _some_ scenarios but I feel like it's also objectively worse in others so it ends up being a wash. It paradoxically looks bad relative to GPT-4 but also makes GPT-4 feel worse when you go back to it...

o1-preview has been growing on me despite its answers also being very formulaic (relative to the OG GPT-3.5 and GPT-4 models which had more "freedom" in how they answered)


Yes, I use 4o for customer support in multiple languages and sometimes I have to tell it to reply using the customer language, while gpt4 could easily infer it.

gpt-4o is a weak version of gpt-4 with "steps-instructions". Gpt-4 is just too expensive which is why openAI is releasing all these mini versions.

> that has given me good results.

Can you help somebody out of the loop frame/judge/measure 'good results'?

Can you give an example of something it can do that's impressive/worthwhile? Can you give an example of where it falls short / gets tripped up?

Is it just a hallucination machine? What good does that do for anybody? Genuinely trying to understand.


It can answer basic questions ("what is the capital of France"), write terrible poetry ("write a poem about a pelican and a walrus who are friends"), perform basic summarization and even generate code that might work 50% of the time.

For a 1.2GB file that runs on my laptop those are all impressive to me.

Could it be used for actual useful work? I can't answer that yet because I haven't tried. The problem there is that I use GPT-4o and Claude 3.5 Sonnet dozens of times a day already, and downgrading to a lesser model is hard to justify for anything other than curiosity.


The implementation has no control on “how smart” the model is, and when it comes to llama 1B, it's not very smart by current standard (but it would still have blown everyone's mind just a few years back).

The implementation absolutely can influence the outputs.

If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.

It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.


How well does bf16 work in comparison?

Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.

I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.


I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.

We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.

Does anyone have experience with higher-precision matmul and whether it is worthwhile?


Isn’t 200 tokens basically nothing? Did you mean to say 2000?

I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.

TIL, thanks.

Could you try with

    ./target/release/chat --model llama3.2-1b-it-q80.lmrs --show-metrics
To know how many token/s you get?

Nice, just tried that with "tell me a long tall tale" as the prompt and got:

    Speed: 26.41 tok/s
Full output: https://gist.github.com/simonw/6f25fca5c664b84fdd4b72b091854...

How much with llama.cpp? A 1b model should be a lot faster on a m2

Given the fact that this at the core relies on the `rayon` and `wide` libraries, which are decently baseline optimized but quite a bit away from what llama.cpp can do when being specialized on such a specific use-case, I think the speed is about what I would expect.

So yeah, I think there is a lot of room for optimization, and the only reason one would use this today is if they want to have a "simple" implementation that doesn't have any C/C++ dependencies for build tooling reasons.


Your point is valid when it comes to rayon (I don't know much about wide) being inherently slower than custom optimization, but from what I've seen I suspect rayon isn't even the bottleneck in terms of performance, there's some decent margin of improvement (I'd expect at least double the throughput) without even doing arcane stuff.

This is beautifully written, thanks for sharing.

I could see myself using some of the source code in the classroom to explain how transformers "really" work; code is more concrete/detailed than all those pictures of attention heads etc.

Two points of minor criticism/suggestions for improvement:

- libraries should not print to stdout, as that output may detroy application output (imagine I want to use the library in a text editor to offer style checking). So best to write to a string buffer owned by a logging class instance associated with a lm.rs object.

- Is it possible to do all this without "unsafe" without twisting one's arm? I see there are uses of "unsafe" e.g. to force data alignment in the model reader.

Again, thanks and very impressive!


> best to write to a string buffer

It's best to call a user callback. That way logs can be, for example, displayed in a GUI.


A good logging framework has all the hooks you need

Doesn't rust have a standard solution for that?

If I use 10 libraries and they all use a different logging framework then that's ... not convenient.


It does, everyone uses the `log` crate. But then it wouldn't be zero-dependencies anymore.

In fairness it's already not really “zero dependency” since it uses rayon (for easy multithreading) and wide (for easy SIMD), using log would make total sense I think (not the main author, just a contributor).

Neat.

FYI I have a whole bunch of rust tools[0] for loading models and other LLM tasks. For example auto selecting the largest quant based on memory available, extracting a tokenizer from a gguf, prompting, etc. You could use this to remove some of the python dependencies you have.

Currently to support llama.cpp, but this is pretty neat too. Any plans to support grammars?

[0] https://github.com/ShelbyJenkins/llm_client


The title is less clear than it could be IMO.

When I saw "no dependency" I thought maybe it could be no_std (llama.c is relatively lightweight in this regard). But it's definitely not `no_std` and in fact seems like it has several dependencies. Perhaps all of them are rust dependencies?


The readme seems to indicate that it expects pytorch alongside several other Python dependencies in a requirements.txt file (which is the only place I can find any form of the word "dependency" on the page). I'm very confused by the characterization in the title here given that it doesn't seem to be claimed at all by the project itself (which simple has the subtitle "Minimal LLM inference in Rust").

From the git history, it looks like the username of the person who posted this here is someone who has contributed to the project but isn't the primary author. If they could elaborate on what exactly they mean by saying this has "zero dependencies", that might be helpful.


> The readme seems to indicate that it expects pytorch alongside several other Python dependencies in a requirements.txt file

That's only if you want to convert the model yourself, you don't need that if you use the converted weights on the author's huggingface page (in “prepared-models” table of the README).

> From the git history, it looks like the username of the person who posted this here is someone who has contributed to the project but isn't the primary author.

Yup that's correct, so far I've only authored the dioxus GUI app.

> If they could elaborate on what exactly they mean by saying this has "zero dependencies", that might be helpful.

See my other response: https://news.ycombinator.com/item?id=41812665


What do you think about implementing your gui for other rust LLM projects? I’m looking for a front end for my project: https://github.com/ShelbyJenkins/llm_client

The original may have made sense, eg "no hardware dependency", or "no GPU dependency". Unfortunately HN deletes words from titles with no rhyme or reason, and no transparency.

Titles are hard.

What I wanted to express is that it doesn't have any pytorch or Cuda or onnx or whatever deep learning dependency and that all the logic is self contained.

To be totally transparent it has 5 Rust dependencies by default, two of them should be feature gated for the chat (chrono and clap), and then there are 3 utility crates that are used to get a little bit more performance out of the hardware (`rayon` for easier parallelization, `wide` for helping with SIMD, and `memmap2` for memory mapping of the model file).


Yeah, hard to not be overly verbose. “No massive dependencies with long build times and deep abstractions!” Is not as catchy.

No dependencies in this case (and pretty much any rust project) means: to build you need rustc+cargo and to use you just need resulting binary.

As in you don't need to have C compiler, python, dynamic libraries. "pure rust" would be a better way to describe it.


It's a little bit more than pure Rust: to build the library there's basically only two dependencies (rayon and wide) which bring only 14 transitive dependencies (anyone who's built even simple Rust program knows that this is a very small number).

And there's more, Rayon and wide are only needed for performance and we could trivially put them behind a feature flag and have zero dependency and have the library work in a no-std context actually, but it would be so slow it would have no use at all so I don't really think that makes sense to do except in order to win an argument…


is rust cargo basically like npm at this point? like how on earth is sixteen dependencies means no dependencies lol

Yes, basically. Someone who is a dependency maximalist (never write any code that can be replaced by a dependency) then you can easily end up with a thousand dependencies. I don't like things being that way, but others do.

It's worth noting that Rust's std library is really small, and you therefore need more dependencies in Rust than in some other languages like Python. There are some "blessed" crates though, like the ones maintained by the rust-lang team themselves (https://crates.io/teams/github:rust-lang:libs and https://crates.io/teams/github:rust-lang-nursery:libs). Also, when you add a dependency like Tokio, Axum, or Polars, these are often ecosystems of crates rather than singular crates.

Tl;dr: Good package managers end up encouraging micro-dependencies and dependency bloat because these things are now painless. Cargo is one of these good package managers.


How about designing a "proper" standard library for Rust (comparable to Java's or CommonLISP's), to ensure a richer experience, avoiding dependency explosions, and also to ensure things are written in a uniform interface style? Is that something the Rust folks are considering or actively working on?

EDIT: nobody is helped by 46 regex libraries, none of which implements Unicode fully, for example (not an example taken from the Rust community).


The particular mode of distribution of code as a traditional standard library has downsides:

- it's inevitably going to accumulate mistakes/obsolete/deprecated stuff over time, because there can be only one version of it, and it needs to be backwards compatible.

- it makes porting the language to new platforms harder, since there's more stuff promised to work as standard.

- to reduce risk of having the above problems, stdlib usually sticks to basic lowest-common-denominator APIs, lagging behind the state of the art, creating a dilemma between using standard impl vs better but 3rd party impls (and large programs end up with both)

- with a one-size-fits-all it's easy to add bloat from unnecessary features. Not all programs want to embed megabytes of Unicode metadata for a regex.

The goal of having common trustworthy code can be achieved in many other ways, such as having (de-facto) standard individual dependencies to choose from. Packages that aren't built-in can be versioned independently, and included only when necessary.


Just use the rust-lang org's regex crate. It's fascinating that you managed to pick one of like 3 high-level use-cases that are covered by official rust-lang crates.

Indeed. It's the one cultural aspect of Rust I find exhausting. Huge fan of the language and the community in general, but a few widespread attitudes do drive me nuts:

* That adding dependencies is something you should take very lightly

* The everybody uses or should use crates.io for dependencies

* That it's OK to just ask users to use the latest release of something at all times

* That vendoring code is always a good thing when it adds even the slightest convenience

* That one should ship generated code (prominent in e.g. crates that use FFI bindings)

* The idea that as long as software doesn't depend on something non-Rust, it doesn't have dependencies

Luckily the language, the standard library and the community in general are of excellent quality.


> like how on earth is sixteen dependencies means no dependencies lol

You're counting optional dependencies used in the binaries which isn't fair (obviously the GUI app or the backend of the webui are going to have dependencies!). But yes 3 dependencies isn't literally no dependency.


Great! Did something similar some time ago [0] but the performance was underwhelming compared to C/C++ code running on CPU (which points to my lack of understanding of how to make Rust fast). Would be nice to have some benchmarks of the different Rust implementations.

Implementing LLM inference should/could really become the new "hello world!" for serious programmers out there :)

[0] https://github.com/gip/yllama.rs


i also had a similar 'hello world' experience some time ago with [0] :). i manually used some SIMD instructions, and it seems the performance could align with llama.cpp. it appears that the key to performance is:

1. using SIMD on quantized matrix multiplication 2. using a busy loop instead of condition variables when splitting work among threads.

(however, i haven't had more free time to continue working on inferencing quantized models on GPU (with Vulkan), and it hasn't been updated for a long time since then.)

[0] https://github.com/crabml/crabml


Correct me if I am wrong, but these implementations are all CPU bound?, i.e. if I have a good GPU, I should look for alternatives.

You are correct. This project is "on the CPU", so it will not utilize your GPU for computation. If you would like to try out a Rust framework that does support GPUs, Candle https://github.com/huggingface/candle/tree/main may be worth exploring

CPU, yes, but more importantly memory bandwidth.

An RTX 3090 (as one example) has nearly 1TB/s of memory bandwidth. You'd need at least 12 channels of the fastest proof-of-concept DDR5 on the planet to equal that.

If you have a discrete GPU, use an implementation that utilizes it because it's a completely different story.

Apple Silicon boasts impressive numbers on LLM inference because it has a unified CPU-GPU high-bandwidth (400GB/s IIRC) memory architecture.


Depends. Good models are big, and require a lot of memory. Even the 4090 doesn't have that much memory in an LLM context. So your GPU will be faster, but likely can't fit the big models.

Yes. Depending on gpu 10-20x difference.

For rust you have the llama.cpp wrappers like llm_client (mine), and the candle based projects mistral.rs, and Kalosm.

Although, my project does try and provide an implementation of mistral.rs, I haven’t fully migrated from llama.cpp. A full rust implementation would be nice for quick install times (among other reasons). Right now my crate has to clone and build. It’s automated for mac, pc, and Linux but it adds about a minute of build time.


It's all implemented on the CPU, yes, there's no GPU acceleration whatsoever (at the moment at least).

> if I have a good GPU, I should look for alternatives.

If you actually want to run it, even just on the CPU, you should look for an alternative (and the alternative is called llama.cpp) this is more of an educational resource about how things work when you remove all the layers of complexity in the ecosystem.

LLM are somewhat magic in how effective they can be, but in terms of code it's really simple.


This is cool (and congrats on writing your first Rust lib!), but Metal/Cuda support is a must for serious local usage.

Using Cuda is a non starter because it would go against the purpose of this project, but I (not the main author but contributor) am experimenting with wgpu to get some kind of GPU acceleration.

I'm not sure it goes anywhere though, because the main author want to keep the complexity under control.


wgpu would be awesome. Too little ML software out there is hardware-agnostic.

That's exactly my feeling and that's why I started working on it.

What's the value of this compared to llama.cpp?

Easier to integrate with other Rust projects maybe?

Cleaner codebase because of fewer features!

Interesting, I appreciate the rust community‘s enthu to rewrite most the stuff.

This is really cool.

It's already using Dioxus (neat). I wonder if WASM could be put on the roadmap.

If this could run a lightweight LLM like RWKV in the browser, then the browser unlocks a whole class of new capabilities without calling any SaaS APIs.


I was poking at this a bit here

https://github.com/maedoc/rwkv.js

using the Rwkv.cpp compiled with emscripten, but I didn’t quite figure out the tokenizers part (yet, only spent about an hour on it)

Nevertheless I am pretty sure the 1.6b rwkv6 would be totally usable offline browser only. It’s not capable enough for general chat but for rag etc it could be quite enough


> I wonder if WASM could be put on the roadmap.

The library itself should be able to compile to WASM with very little change: rayon and wide the only mandatory dependencies support wasm out of the box, and to get rid of memmap2 by replacing the `Mmap` type in transformer.rs with `&[u8]`.

That being said, RWKV is a completely different architecture so it should be reimplemented entierly and is not likely to be part of the roadmap ever (not the main author so I can't say for sure, but I really doubt it).



Much simpler codebase because it has much less features. It doesn't aim to be a llama.cpp competitor AFAIK.

Nice work, it would be great to see some benchmarks comparing it to llm.c.

I doubt it would compare favorably at the moment, I don't think it's particularly well optimized besides using rayon to get CPU parallelism and wide for a bit of SIMD.

It's good enough to get pretty good performance for little effort, but I don't think it would win a benchmark race either.


Would love to see a wasm version of this!

Quite curious to hear: why?

Asking because this program isn't useful without 3G of model data, and WASM isn't useful outside of the browser (and perhaps some blockchain applications), where 3G of data isn't going to be practically available.


Another llama.cpp and mistral.rs? If it support vision models then fine, I will try it.

EDIT: Looks like no L3.2 11B yet.


It supports the PHI 3.5 vision model since yesterday actually.

I think a 11B model would be way too slow in its current shape though.


Such a talented guy!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: