Exactly. How can "we" develop and encourage benchmarks for multi-turn user assistance?
That is what I want.
I feel like the models and harnesses push much too hard against this workflow -- that they push you towards letting go and vibe coding, with only your discipline (and desire for a quality and maintainable product) holding it back.
As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.
27B seems surprisingly resiliant to quantisation. Though my evals showed there was some impact to coding ability from 8 bit to 4 bit, it was less than I would've expected: and it was on task types that you've said above that you don't really do with these!
Yes. I think one of the big advantages of SoA is that you only pay for the fields you're currently using.
If you need a field somewhere, you can add it and only pay the cost of iterating it where you need it.
foot also offers a client/server architecture.
If you start a foot server (e.g. with a systemd service), you can use `footclient -N`.
This may reduce the memory pressure of running many terminals.
This is similar to the `kitty --singleinstance` mentioned in another comment by amarshall.
I was once a bit of a Julia performance expert, but moved toward c++ for hobby projects even while still using Julia professionally.
I wrote a blog post at the time with exactly that punchline (not explicitly stated, but just look at the code!):
https://spmd.org/posts/multithreadedallocations/
The example was similar to a real production-critical hot path from work.
Maybe things changed since I left Julia, but that was December 2023, for years after this blog post.
I'm still working on it.
I'm currently working on a cache tile-size optimization algorithm that should (a) handle trees (a set of loops can be merged at some cache levels and split at others, e.g. in an MLP it may carry an output through the L3 cache, while doing sub-operations in the L2/L1/registers) (b) converge reasonably quickly so compile times are acceptable.
This is the last step before I move to code generation and then generating a ton of test cases/debugging.
My goal is some form of release by the end of the year.
Yeah, for now.
I'd like it to be open, but I also want to potentially be able to make money/a living off of it.
My dream would be that it can be open while hardware vendors pay me to optimize for their hardware.
For how, being closed gives me more options. It's a lot easier to open in the future than to close, so it's just keeping options open.
I've thought a lot more about the engineering than any sort of marketing or businesses plan, so I just want to defer those.
I'm just messing around with building agents, that's all. I'm not super interested in making ones that just sit in a terminal executing shell scripts because truth be told they're absolutely trivial to make and don't show any interesting parts of LLMs, whereas telling an agent that they are sitting in Kakoune is a whole lot more interesting and really shows a lot of what LLMs aren't great at, and how they'll have to fight their urge to spit out overwrought bash invocations or at the very least find a way to fit those into something new.
So far the only tools the agent has access to are `evaluate_commands(commands=["...", "..."])` and `get_buffer_contents()`, which really makes them have to work for doing things. I could make it super easy for them but then it wouldn't be an interesting experiment.
If I were to try to make something more useful out of this, I'd probably add the ability for LLMs to list buffers, probably give them an easier out for executing shell scripts in the way they prefer, give them an easier time to list docs and a few other things like that.
The tools and the interaction with Kakoune is really trivial to write; I already use this by having the agent write to the session FIFO (a very simple binary format) and I extract information via my own FIFO that Kakoune writes to (this is used for the buffer data only right now).
I think once you started using it more as a tool and not a pseudo-benchmark like I am you'd probably think of even more things to add but a lot of it comes down to just making Kakoune's state visible and making shell spam (which the LLMs love) easier.
In my experience, llms don't reason well about expected states, contracts, invariants, etc.
Partly because that don't have long term memory and are often forced to reason about code in isolation.
Maybe this means all invariants should go into AGENTS.md/CLAUDE.md files, or into doc strings so a new human reader will quickly understand assumptions.
Regardless, I think a habit of putting contracts to make pre- and post-conditions clear could help an AI reason about code.
Maybe instead of suggesting a patch to cover up a symptom, an AI may reason that a post-condition somewhere was violated, and will dig towards the root cause.
This applies just as well to asserts, too.
Contracts/asserts actually need to be added to tell a reader something.
reply