More

mark_l_watson · 2026-03-13T18:20:23 1773426023

I have spent a HUGE amount of time the last two years experimenting with local models.

A few lessons learned:

1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.

2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...

Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.

I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.

sdrinf · 2026-03-13T20:35:38 1773434138

Just want to echo the recommendation for qwen3.5:9b. This is a smol, thinking, agentic tool-using, text-image multimodal creature, with very good internal chains of thought. CoT can be sometimes excessive, but it leads to very stable decision-making process, even across very large contexts -something we haven't seen models of this size before.

What's also new here, is VRAM-context size trade-off: for 25% of it's attention network, they use the regular KV cache for global coherency, but for 75% they use a new KV cache with linear(!!!!) memory-token-context size expansion! which means, eg ~100K token -> 1.5gb VRAM use -meaning for the first time you can do extremely long conversations / document processing with eg a 3060.

Strong, strong recommend.

steve_adams_86 · 2026-03-13T21:06:23 1773435983

I've been building a harness for qwen3.5:9b lately (to better understand how to create agentic tools/have fun) and I'm not going to use it instead of Opus 4.6 for my day job but it's remarkably useful for small tasks. And more than snappy enough on my equipment. It's a fun model to experiment with. I was previously using an old model from Meta and the contrast in capability is pretty crazy.

I like the idea of finding practical uses for it, but so far haven't managed to be creative enough. I'm so accustomed to using these things for programming.

tempoponet · 2026-03-13T23:18:16 1773443896

What kind of small tasks do you find it's good at? My non-coding use of agents has been related to server admin, and my local-llm use-case is for 24/7 tasks that would be cost-prohibitive. So my best guess for this would be monitoring logs, security cameras, and general home automation tasks.

steve_adams_86 · 2026-03-14T01:04:43 1773450283

That's about it. The harness is still pretty rudimentary so I'm sure the system could be more capable, and that might reveal more interesting opportunities. I don't really know.

So far I've got it orchestrating a few instances to dig through logs, local emails, git repositories, and github to figure out what I've been doing and what I need to do. Opus is waayyy better at it, but Qwen does a good enough job to actually be useful.

I tried having it parse orders in emails and create a CSV of expenses, and that went pretty badly. I'm not sure why. The CSV was invalid and full of bunk entries by the end, almost every time. It missed a lot of expenses. It would parse out only 5 or 6 items of 7, for example. Opus and Sonnet do spectacular jobs on tasks like this, and do cool things like create lists of emails with orders then systematically ensure each line item within each email is accounted for, even without prompting to do so. It's an entirely different category of performance.

Automation is something I'd like to dabble in next, but all I can think of it being useful for is mapping commands (probably from voice) to tool calls, and the reality is I'd rather tap a button on my phone. My family might like being able to use voice commands, though. Otherwise, having it parse logs to determine how to act based on thresholds or something would also be far better implemented with simple algorithms. It's hard to find truly useful and clear fits for LLMs

novok · 2026-03-14T03:27:21 1773458841

Oh man you just gave me an idea to use something like qwen 3.5 to categorize a lot of emails. You can keep the context small, do it per email and just churn through a lot of crap.

threecheese · 2026-03-13T22:24:54 1773440694

You can really see the limitations of qwen3.5:9b in reasoning traces- it’s fascinating. When a question “goes bad”, sometimes the thinking tokens are WILD - it’s like watching the Poirot after a head injury.

Example: “what is the air speed velocity of a swallow?” - qwen knew it was a Monty Python gag, but couldnt and didnt figure out which one.

scottmf · 2026-03-14T01:23:35 1773451415

As a person who also knows there's a connection between that phrase and Monty Python and not much more information beyond that, I'm not sure how to feel.

cassianoleal · 2026-03-14T03:11:48 1773457908

African or European?

kingo55 · 2026-03-13T21:08:50 1773436130

How's it compare in quality with larger models in the same series? E.g 122b?

ggsp · 2026-03-13T21:34:32 1773437672

How much difference are you seeing between standard and Q4 versions in terms of degradation, and is it constant across tasks or more noticeable in some vs others?

rnewme · 2026-03-13T21:39:02 1773437942

Less than expected, search for unsloths recent benchmark

dsr_ · 2026-03-13T22:10:04 1773439804

Correction: not thinking, not a creature.

If it was a creature I would feel some sorrow when I killed it.

If you are feeling sorrow when you reboot a machine running an LLM, get to a psychiatrist ASAP.

scronkfinkle · 2026-03-14T00:45:16 1773449116

Do you also require computers to grow legs when they "run"?

"Thinking" is just a term to describe a process in generative AI where you generate additional tokens in a manner similar to thinking a problem through. It's kind of a tired point to argue against the verb since it's meaning is well understood at this point

dsr_ · 2026-03-14T01:19:40 1773451180

I am a professional in the information technology field, which is to say a pedantic extremist who believes that words have meanings derived from consensus, and when people alter the meanings, they alter what they believe.

Using "thinking", "feeling", "alive", or otherwise referring to a current generation LLM as a creature is a mistake which encourages being wrong in further thinking about them.

econ · 2026-03-14T02:07:40 1773454060

We lack much vocabulary in this new situation. Not that I have words for it but to paint the picture: if I hang out with people sharing some quality I tend to assume it's there in others and treat them as such. LLMs might not be people, I doubt our subconscious knows the difference.

There is this ancient story where man was created to mine gold in SA. There was some disagreement whether or not to delete the creatures afterwards. The jury is still out on what the point is.

Consulting our feelings seems good, the feelings were trained on millions of years worth of interactions. Non of them were this tho.

What would be the point for you of uhh robotmancipation?

Edit: for me it would get complicated if it starts screaming and begging not to be deleted. Which I know makes no sense.

twelvedogs · 2026-03-14T03:24:21 1773458661

think you're on the wrong side of the consensus here

colechristensen · 2026-03-14T01:34:28 1773452068

I'd suggest spending more time studying words to relive your extremism. The meanings of words move incredibly quickly and a tremendous number of words have little to no relation to previous meanings.

Words such as nice, terrific, awful, manufacture, naughty, decimate, artificial, bully... and on and on.

AnimalMuppet · 2026-03-14T01:38:27 1773452307

When people alter the meanings, you need to start using different words to describe what you believe.

woctordho · 2026-03-14T03:23:01 1773458581

Then don't get sorrow killing it. Living things are not so special.

peddling-brink · 2026-03-13T23:46:02 1773445562

Rebooting a machine running an LLM isn’t noticed by the LLM.

Would you feel comfortable digitally torturing it? Giving it a persona and telling it terrible things? Acts of violence against its persona?

I’m not confident it’s not “feeling” in a way.

Yes its circuitry is ones and zeros, we understand the mechanics. But at some point, there’s mechanics and meat circuitry behind our thoughts and feelings too.

It is hubris to confidently state that this is not a form of consciousness.

colechristensen · 2026-03-14T01:38:50 1773452330

I'm not entirely opposed to the kind of animism that assigns a certain amount of soul, consciousness, or being to everything in a spectrum between a rock and a philosopher... but even so.

Multiplying large matrices over and over is very much towards the "rock" end of that scale.

hnfong · 2026-03-14T01:58:38 1773453518

If we accept the Church-Turing thesis, a philosopher can be simulated by a simple Universal Turing machine.

If one day we are able to create a philosopher from such a rudimentary machine (and a lot of tape), would you consider that very much towards the "rock" end as well?

fragmede · 2026-03-14T00:05:47 1773446747

What do you imagine the psychiatrist will do? That's an incredibly dismissive take.

dsr_ · 2026-03-14T01:13:16 1773450796

Accept it in the spirit it was meant: if you have mental illnesses like this, you need treatment.

998244353 · 2026-03-14T01:37:59 1773452279

Ok but no one here actually implied that they think like this.

johnmaguire · 2026-03-13T18:54:47 1773428087

I'd love to know how you fit smaller models into your workflow. I have an M4 Macbook Pro w/ 128GB RAM and while I have toyed with some models via ollama, I haven't really found a nice workflow for them yet.

aneyadeng · 2026-03-14T04:21:14 1773462074

Here's a workflow that works well for me with local models on a Mac:

For code review/quality tasks, I use smaller models (7-14B) as a first pass - they're surprisingly good at catching common AI-generated code issues like hallucinated package imports, deprecated API usage, and context drift. I pipe git diffs through Ollama's API and get structured JSON output back.

For anything that needs real reasoning, I fall back to a cloud model. The key insight is that most "AI code review" tasks are actually pattern matching, not reasoning - and small models excel at that.

The setup: Ollama for serving + a thin Node.js wrapper that handles batching and output parsing. Runs as a local CI check. For your 128GB Mac, you could run the detection model AND a larger model simultaneously without any VRAM issues.

If you're interested, I built an open-source tool for exactly this: AI-generated code defect detection in CI pipelines. Happy to share details.

philipkglass · 2026-03-13T19:00:36 1773428436

It really depends on the tasks you have to perform. I am using specialized OCR models running locally to extract page layout information and text from scanned legal documents. The quality isn't perfect, but it is really good compared to desktop/server OCR software that I formerly used that cost hundreds or thousands of dollars for a license. If you have similar needs and the time to try just one model, start with GLM-OCR.

If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be frustrating if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to interpreting and transforming unstructured data.

mandeepj · 2026-03-13T20:58:11 1773435491

> I formerly used that cost hundreds or thousands of dollars for a license

Azure Doc Intelligence charges $1.50 for 1000 pages. Was that an annual/recurring license?

Would you mind sharing your OCR model? I'm using Azure for now, as I want to focus on building the functionality first, but would later opt for a local model.

philipkglass · 2026-03-13T21:18:42 1773436722

I took a long break from document processing after working on it heavily 20 years ago. The tools I used before were ABBYY FineReader and PrimeOCR. I haven't tried any of the commercial cloud based solutions. I'm currently using GLM-OCR, Chandra OCR, and Apple's LiveText in conjunction with each other (plus custom code for glue functionality and downstream processing).

Try just GLM-OCR if you want to get started quickly. It has good layout recognition quality, good text recognition quality, and they actually tested it on Apple Silicon laptops. It works easily out-of-the-box without the yak shaving I encountered with some other models. Chandra is even more accurate on text but its layout bounding boxes are worse and it runs very slowly unless you can set up batched inference with vLLM on CUDA. (I tried to get batching to run with vllm-mlx so it could work entirely on macOS, but a day spent shaving the yak with Claude Opus's help went nowhere.)

If you just want to transcribe documents, you can also try end-to-end models like olmOCR 2. I need pipeline models that expose inner details of document layout because I need to segment and restructure page contents for further processing. The end-to-end models just "magically" turn page scans into complete Markdown or HTML documents, which is more convenient for some uses but not mine.

D-Machine · 2026-03-13T21:22:57 1773436977

These are some really great explicit examples and links, much appreciated.

naasking · 2026-03-14T00:58:43 1773449923

How does GLM-OCR compare to Qwen 3 VL? I've had good experiences with Qwen for these purposes.

philipkglass · 2026-03-14T02:39:41 1773455981

Qwen 3 and 3.5 models are quite capable. Perhaps the greatest benefit of GLM-OCR is speed: it's only a 0.9 billion parameter model, so it's fast enough to run on large volumes of complicated scans even if all you have for inference is an entry level MacBook or a low end Nvidia card. Even CPU based inference on basic laptops is probably tolerable with it for small page volumes.

tempaccount5050 · 2026-03-13T22:29:02 1773440942

Not OP but I had an XML file with inconsistent formatting for album releases. I wanted to extract YouTube links from it, but the formatting was different from album to album. Nothing you could regex or filter manually. I shoved it all into a DB, looked up the album, then gave the xml to a local LLM and said "give me the song/YouTube pairs from this DB entry". Worked like a charm.

Bluecobra · 2026-03-13T19:30:20 1773430220

I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!

lambda · 2026-03-13T20:01:53 1773432113

I've got a 128 GiB unified memory Ryzen Ai Max+ 395 (aka Strix Halo) laptop.

Trying to run LLM models somehow makes 128 GiB of memory feel incredibly tight. I'm frequently getting OOMs when I'm running models that are pushing the limits of what this can fit, I need to leave more memory free for system memory than I was expecting. I was expecting to be able to run models of up to ~100 GiB quantized, leaving 28 GiB for system memory, but it turns out I need to leave more room for context and overhead. ~80 GiB quantized seems like a better max limit when trying not running on a headless system so I'm running a desktop environment, browser, IDE, compilers, etc in addition to the model.

And memory bandwidth limitations for running the models is real! 10B active parameters at 4-6 bit quants feels usable but slow, much more than that and it really starts to feel sluggish.

So this can fit models like Qwen3.5-122B-A10B but it's not the speediest and I had to use a smaller quant than expected. Qwen3-Coder-Next (80B/3B active) feels quite on speed, though not quite as smart. Still trying out models, Nemotron-3-Super-120B-A12B just came out, but looks like it'll be a bit slower than Qwen3.5 while not offering up any more performance, though I do really like that they have been transparent in releasing most of its training data.

zozbot234 · 2026-03-13T20:30:36 1773433836

There's been some very recent ongoing work in some local AI frameworks on enabling mmap by default, which can potentially obviate some RAM-driven limitations especially for sparse MoE models. Running with mmap and too little RAM will then still come with severe slowdowns since read-only model parameters will have to be shuttled in from storage as they're needed, but for hardware with fast enough storage and especially for models that "almost" fit in the RAM filesystem cache, this can be a huge unblock at negligible cost. Especially if it potentially enables further unblocks via adding extra swap for K-V cache and long context.

AzN1337c0d3r · 2026-03-13T19:35:19 1773430519

Most workstation class laptops (i.e. Lenovo P-series, Dell Precision) have 4 DIMM slots and you can get them with 256 GB (at least, before the current RAM shortages).

There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.

Only Apple has the unique dynamic allocation though.

numpad0 · 2026-03-14T04:02:52 1773460972

Intel had dynamic allocation since Intel 830(2001) for Pentium III Mobile. Everything always did, especially platforms with iGPUs like Xbox 360.

Only Apple and AMD have APUs with relatively fast iGPU that becomes relevant in large local LLM(>7b) use cases.

the_pwner224 · 2026-03-13T19:50:22 1773431422

Yep, I have a 13" gaming tablet with the 128 GB AMD Strix Halo chip (Ryzen AI Max+ 395, what a name). Asus ROG Flow Z13. It's a beast; the performance is totally disproportionate to its size & form factor.

I'm not sure what exactly you're referring to with "Only Apple has the unique dynamic allocation though." On Strix Halo you set the fixed VRAM size to 512 MB in the BIOS, and you set a few Linux kernel params that enable dynamic allocation to whatever limit you want (I'm using 110 GB max at the moment). LLMs can use up to that much when loaded, but it's shared fully dynamically with regular RAM and is instantly available for regular system use when you unload the LLM.

wilkystyle · 2026-03-13T20:59:16 1773435556

What operating system are you using? I was looking at this exact machine as a potential next upgrade.

the_pwner224 · 2026-03-13T21:10:01 1773436201

Arch with KDE, it works perfectly out of the box.

I configured/disabled RGB lighting in Windows before wiping and the settings carried over to Linux. On Arch, install & enable power-profiles-daemon and you can switch between quiet/balanced/performance fan & TDP profiles. It uses the same profiles & fan curves as the options in Asus's Windows software. KDE has native integration for this in the GUI in the battery menu. You don't need to install asus-linux or rog-control-center.

For local AI: set VRAM size to 512 MB in the BIOS, add these kernel params:

ttm.pages_limit=31457280 ttm.page_pool_size=31457280 amd_iommu=off

Pages are 4 KiB each, so 120 GiB = 120 x 1024^3 / 4096 = 31457280

To check that it worked: sudo dmesg | grep "amdgpu.*memory" will report two values. VRAM is what's set in BIOS (minimum static allocation). GTT is the maximum dynamic quota. The default is 48 GB of GTT. So if you're running small models you actually don't even need to do anything, it'll just work out of the box.

LM Studio worked out of the box with no setup, just download the appimage and run it. For Ollama you just `pacman -S ollama-rocm` and `systemctl enable --now ollama`, then it works. I recently got ComfyUI set up to run image gen & 3d gen models and that was also very easy, took <10 minutes.

I can't believe this machine is still going for $2,800 with 128 GB. It's an incredible value.

wilkystyle · 2026-03-14T01:13:31 1773450811

Really appreciate this response! Glad to hear you are running Arch and liking it.

I've been a long-time Apple user (and long-time user of Linux for work + part-time for personal), but have been trying out Arch and hyprland on my decade+ old ThinkPad and have been surprised at how enjoyable the experience is. I'm thinking it might just be the tipping point for leaving Apple.

xnzakg · 2026-03-13T21:44:10 1773438250

You may wanna see if openrgb isn't able to configure the RGB. Could even do some fun stuff like changing the color once done with a training run or something

lambda · 2026-03-13T20:06:22 1773432382

> Only Apple has the unique dynamic allocation though.

What do you mean? On Linux I can dynamically allocate memory between CPU and GPU. Just have to set a few kernel parameters to set the max allowable allocation to the GPU, and set the BIOS to the minimum amount of dedicated graphics memory.

AzN1337c0d3r · 2026-03-13T20:29:11 1773433751

Maybe things have changed but the last time I looked at this, it was only max 96GB to the GPU. And it isn't dynamic in the sense you still have to tweak the kernel parameters, which require a reboot.

Apple has none of this.

the_pwner224 · 2026-03-13T20:37:29 1773434249

Strix Halo you can get at least 120 GB to the GPU (out of 128 GB total), I'm using this configuration.

Setting the kernel params is a one-time initial setup thing. You have 128 GB of RAM, set it to 120 or whatever as the max VRAM. The LLM will use as much as it needs and the rest of the system will use as much it needs. Fully dynamic with real-time allocation of resources. Honestly I literally haven't even thought of it after setting those kernel args a while ago.

So: "options ttm.pages_limit=31457280 ttm.page_pool_size=31457280", reboot, and that's literally all you have to do.

Oh and even that is only needed because the AMD driver defaults it to something like 35-48 GB max VRAM allocation. It is fully dynamic out of the box, you're only configuring the max VRAM quota with those params. I'm not sure why they choice that number for the default.

lambda · 2026-03-13T20:40:57 1773434457

You do have to set the kernel parameters once to set the max GPU allocation, I have it set to 110 GiB, and you have to set a BIOS setting to set the minimum GPU allocation, I have it set to 512 MiB. Once you've set those up, it's dynamic within those constraints, with no more reboots required.

On Windows, I think you're right, it's max 96 GiB to the GPU and it requires a reboot to change it.

saltwounds · 2026-03-13T19:11:21 1773429081

I use Raycast and connect it to LM Studio to run text clean up and summaries often. The models are small enough I keep them in memory more often than not

echelon · 2026-03-13T20:11:18 1773432678

Shouldn't we prioritize large scale open weights and open source cloud infra?

An OpenRunPod with decent usage might encourage more non-leading labs to dump foundation models into the commons. We just need infra to run it. Distilling them down to desktop is a fool's errand. They're meant to run on DC compute.

I'm fine with running everything in the cloud as long as we own the software infra and the weights.

This is conceivably the only way we could catch up to Claude Code is to have the Chinese start releasing their best coding models and for them to get significant traction with companies calling out to hosted versions. Otherwise, we're going to be stuck in a take off scenario with no bridge.

girvo · 2026-03-13T21:19:22 1773436762

I run Qwen3.5-plus through Alibaba’s coding plan (Model Studio): incredibly cheap, pretty fast, and decent. I can’t compare it to the highest released weight one though.

singpolyma3 · 2026-03-13T22:10:56 1773439856

Is that https://www.alibabacloud.com/help/en/model-studio/coding-pla... ? I was a bit confused that it seems to be sized in requests not tokens

girvo · 2026-03-14T03:42:46 1773459766

Yeah that's the one. I've not managed to get close to the limits that the cheapest plan has. Though I did get to sign up at $3 a month which has been neat, too, seems that's gone now

constantinum · 2026-03-14T00:59:30 1773449970

I also want to try Qwen 3.5 plus. I have a doubt, I see almost same pricing for both Qwen and Claude code(the difference being the highest pro plan looks cheaper), and not for the lower plans. Am I missing something, when you say “cheaper” ??

girvo · 2026-03-14T03:41:47 1773459707

I'm using their $3 USD (currently, it will go up in price later I believe - edit: just checked and yeah, so the $10 one) lite plan, and I'm yet to get close to hitting the request limits when I swap to it once I'm out of Claude tokens.

flutetornado · 2026-03-13T23:16:08 1773443768

My experience with qwen3.5 9b has not been the same. It’s definitely good at agentic responses but it hallucinates a lot. 30%-50% of the content it generated for a research task (local code repo exploration) turned out to be plain wrong to the extent of made up file names and function names. I ran its output through KimiK2 and asked it to verify its output - which found out that much of what it had figured out after agentic exploration was plain wrong. So use smaller models but be very cautious how much you depend on their output.

adamkittelson · 2026-03-13T21:49:07 1773438547

Anecdotal but for some reason I had a pretty bad time with qwen3.5 locally for tool usage. I've been using GPT-OSS-120B successfully and switched to qwen so that I could process images as well (I'm using this for a discord chat bot).

Everything worked fine on GPT but Qwen as often as not preferred to pretend to call a tool and not actually call it. After much aggravation I wound up just setting my bot / llama swap to use gpt for chat and only load up qwen when someone posts an image and just process / respond to the image with qwen and pop back over to gpt when the next chat comes in.

GorbachevyChase · 2026-03-13T22:43:42 1773441822

You are responsible for the dead internet theory.

dhblumenfeld1 · 2026-03-13T21:49:39 1773438579

Have you found that using a frontier model for planning and small local model for writing code to be a solid workflow? Been wanting to experiment with relying less on Claude Code/Codex and more on local models.

eek2121 · 2026-03-13T21:59:44 1773439184

Qwen is actually really good at code as well. I used qwen3-coder-next a while back and it was every bit as good as claude code in the use cases I tested it in. Both made the same amount of mistakes, and both did a good job of the rest.

storus · 2026-03-13T23:15:04 1773443704

Coding locally with Qwen3-Coder-Next or Qwen-3.5 is a piece of cake on a workstation card (RTX Pro 6000); set it up in llama.cpp or vLLM in 1 hour, install Claude Code, force local API hostname and fake secret key, and just run it like regular setup with Claude4 but on Qwen.

dataflow · 2026-03-13T21:21:46 1773436906

Thanks for sharing this, it's super helpful. I have a question if you don't mind: I want a model that I can feed, say, my entire email mailbox to, so that I can ask it questions later. (Just the text content, which I can clean and preprocess offline for its use.) Have any offline models you've dealt with seemed suitable for that sort of use case, with that volume of content?

perbu · 2026-03-13T21:41:48 1773438108

Prompt injection is a problem if your agent has access to anything.

The local models are quite weak here.

dataflow · 2026-03-13T21:55:58 1773438958

Security is not a concern for the purpose of my question here, please ignore that for now. I'm just looking for text summary and search functionality here, not looking to give it full system access and let it loose on my computer or network. I can easily set up VM/sandboxing/airgapping/etc. as needed.

My question is really just about what can handle that volume of data (ideally, with the quoted sections/duplications/etc. that come with email chains) and still produce useful (textual) output.

chrisweekly · 2026-03-13T22:51:37 1773442297

Thanks for this, Mark. And for your website and books and generosity of spirit. Signal in the noise. Have an awesome weekend!

sakesun · 2026-03-13T22:29:27 1773440967

Becoming a retired builder is the ultimate bliss.

manmal · 2026-03-13T18:50:40 1773427840

What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?

girvo · 2026-03-13T21:20:06 1773436806

IMO you’re better off using qwen3.5-plus through the model studio coding plan, but ymmv

nine_k · 2026-03-13T18:29:57 1773426597

What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.

fzzzy · 2026-03-13T19:40:28 1773430828

A Mac Pro with 512 gb unified ram does not exist.

nine_k · 2026-03-13T19:55:57 1773431757

Mac Studio Ultra, my bad. The 512 GB option existed up until March 2026: https://macdailynews.com/2026/03/06/apple-drops-512gb-m3-ult...

kylehotchkiss · 2026-03-13T18:53:34 1773428014

I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?

cyanydeez · 2026-03-13T20:53:14 1773435194

Cline (https://marketplace.visualstudio.com/items?itemName=saoudriz...) in vscode, inside a code-server run within docker (https://docs.linuxserver.io/images/docker-code-server/) using lmstudio (https://lmstudio.ai/) to access unsloth models (https://unsloth.ai/docs/get-started/unsloth-model-catalog) speficially (https://unsloth.ai/docs/models/qwen3-coder-next) appears to be right at the edge of productivity, as long as you realize what complexity means when issuing tasks.

mark_l_watson · 2026-03-12T14:39:04 1773326344

If I have time I want to try this today because it matches my LLM-based work style, especially when I am using local models: I have command line tools that help me generated large one-shot prompts that I just paste into an Ollama repl - then I check back in a while.

It looks like Axe works the same way: fire off a request and later look at the results.

jrswab · 2026-03-12T15:13:32 1773328412

Exactly! I also made it to chain them together so each agent only gets what it needs to complete its one specific job.

mark_l_watson · 2026-03-12T11:22:16 1773314536

Lisp languages are niche, but frequently used as seen in the great projects mentioned in this thread. Since 1982, I have been employed about 20% of my time using mostly Common Lisp and for a few years Clojure. Racket is a great language and system for learning and having fun, so, have fun!

mark_l_watson · 2026-03-11T11:44:58 1773229498

I have had an Oculus 2 for many years and while I love it, I rarely spend more than an hour or two a month using it because time in VR competes with activities like walking outside getting fresh air and sun on my face or sitting with my wife or a friend having coffee, or spending time writing a book.

I think we need more wonderful technology that is designed for brief high-value periods of use.

A good example: I get huge value from using AI, but cumulatively I spend perhaps two to three hours a week using Claude or Gemini. Quality products that I appreciate but don't need to spend a lot of time with.

mark_l_watson · 2026-03-11T11:31:23 1773228683

Indeed! Your comment is probably the most important in this thread. The Korean/German philosopher Byung-Chul Han writes a lot about losing humanity because of tech advances.

I am retired so this is easier for me to do: For every hour each day I spend on tech (personal AI research, writing) I spend 90 minutes hiking with friends, playing games like Bridge, enjoying meals with my wife and friends, reading good literature and philosophy, etc.

I worked for 50 years before retiring, but even working, I tried to balance human time vs. tech and work - often leaving 'money on the table' but it was worth it.

Pardon an old man ranting, but I think so many people seem caught up in the wrong things.

mark_l_watson · 2026-03-10T10:27:25 1773138445

I mostly run Emacs in a terminal, except I configure for two finger scroll on Mac trackpad and tap to move cursor. I also reduced the size of my .emacs by 60% in the last year.

mark_l_watson · 2026-03-09T02:52:15 1773024735

That is kind of beautiful. Reading the code in main.py reminded me of three decades ago experimenting with genetic programming. Very cool.

mark_l_watson · 2026-03-08T09:38:05 1772962685

I understand the author’s sentiment but I would like to give a counter example:

I like to read philosophy and after I read a passage and think about it, I find it useful to copy the passage into a decent model and ask for its interpretation, or if it is something old ask about word choice or meaning.

I realize that I may not be getting perfect information, but LLM output gives me ideas that are a combination of live web searching and whatever innate knowledge the LLM holds in its weights.

Another counter example: I have never found runtime error traces from languages like Haskell and Common Lisp to be that clear. If the error is not clear to me, sometimes using a model gets me past an error quickly.

All that said, I think the author is right-on correct that using LLMs should not be an excuse to not think for oneself.

delusional · 2026-03-08T10:35:04 1772966104

> I realize that I may not be getting perfect information, but LLM output gives me ideas that are a combination of live web searching and whatever innate knowledge the LLM holds in its weights.

I don't mean to be judgemental. It's possible this is a personal observation, but I do wonder if it's not universal. I find that if I give an inch to these models thinking, I instantly become lazy. It doesn't really matter if they produce interesting output, but rather that I stop trying to produce interesting thoughts because I can't help wonder if the LLM wouldn't have output the same thing. I become TOO output focused. I mistake reading an interpretation for actually integrating knowledge into my own thinking, I disregard following along with the author.

I love reading philosophy as well. Dialectic of Enlightenment profoundy shaped how I view the world, but there was not a single part of that book that I could have given you a coherent interpretation of as a read it. The interpretations all come now, years after I read it. I can't help but wonder if those interpretations would have been different, had my subcouncious been satiated by cheap explanations from the lie robot.

demorro · 2026-03-08T12:22:48 1772972568

Seconding this. Revelation happens subtly, often far removed from what you might later unpick as its "primary source". Immediate interpretation tend to be plastic and shallow.

sph · 2026-03-08T15:08:08 1772982488

Also it might be hard to grasp for most of us, used to constant stimulation and lack of space for contemplation and incorporation of information (I recommend the works of philosopher Byung-Chul Han on the matter) with yet unknown effects on our psyche and creative output. It takes days or weeks for one to sit and digest novel viewpoints; asking a machine to skip all that work for us is just another example of seeking instant gratification. I have no time to think, do it for me, so I can scroll to the next post already.

simianwords · 2026-03-08T13:41:41 1772977301

I don’t think you are wrong but isn’t it obvious to pick and choose cases where you might want to use LLMs vs doing the work? Seems obvious to me.

Sure if you want to read a novel, don’t ask an llm about it.

When you want to learn something quick then use LLMs. But you would know how much compression is going on. This is something we do routinely anyway. If I want to know something about taxes, I read the first google result and get the gist of it. But I’m still better off and didn’t require to take a full course.

kolinko · 2026-03-08T11:30:56 1772969456

For me it's the opposite - sure, for many outputs I don't need to think, but then I end up thinking on a higher level, and doing even more work.

An analogy would be - if GPS allows you to not worry about which turn to take, you can finally focus on where you want to get.

XenophileJKO · 2026-03-08T09:41:15 1772962875

I mean it can also depend on scale. I use hundreds of sub-agent instances to do analysis that I just would not be able to do in a reasonable timeframe. That is a TON of thinking done for me.

4ndrewl · 2026-03-08T10:24:26 1772965466

Is "not having to think" a good metric now?

rgoulter · 2026-03-08T10:42:54 1772966574

While "I don't have to think, I just get the LLM to do the task" is a bit careless (or a "hype" way of putting it)... I'd reckon it's always been true that you want to think about the stuff that matters and the other stuff to be done for minimal effort.

e.g. By using a cryptography library / algorithm someone else has written, I don't need to think about it (although someone has done the thinking, I hope!). Or by using a high level language, I don't need to think about how to write assembly / machine code.

Or with a tool like a spell-checker: since it checks the document, you don't have to worry about spelling mistakes.

What upsets is the imbalance between "tasks which previously required some thought/effort can now be done effortlessly". -- Stuff like "write out a document" used to signal effort had been put into it.

mr_mitm · 2026-03-08T10:38:36 1772966316

I think it could be. It doesn't have to be one or the other.

In my opinion it's entirely comparable to anything else that augments human capability. Is it a good thing that I can drive somewhere instead of walking? It can be. If driving 50 miles means I get there in an hour instead of two days, it can be a good thing, even though it could also be a bad thing if I replace all walking with driving. It just expands my horizon in the sense that I can now reach places I otherwise couldn't reasonably reach.

Why can't it be the same with thinking? If the machine does some thinking for me, it can be helpful. That doesn't mean I should delegate all thinking to the machine.

XenophileJKO · 2026-03-08T11:11:06 1772968266

I guess, how often do you pay someone to fix your car? Repair something in your house? Give you financial advice?

Those are all things many people outsource their thinking to other people with.

v3xro · 2026-03-08T11:19:44 1772968784

No, you outsource it because it's not your core competency. I think humans should be able to do anything and not narrowly specialise as narrow specialisation leads to tunnel vision. Sometimes you need to outsource to someone because of legal reasons (and rightly so, mostly because the complexities involved do require someone who is a professional in that area). Can some things be simplified? Of course they can, and there are many barriers that prevent such simplification. But it's absolutely insane to say - nah, we don't need to think at all, and something else can do all the work.

mr_mitm · 2026-03-08T16:41:56 1772988116

Nobody said "we don't need to think at all" though. The statement was "not having to think", or rephrased: "being able to choose how much to think or what to think about".

hectormalot · 2026-03-08T10:25:19 1772965519

There’s both a quality and quantity angle.

For some work, similar to the philosophy example of GP, LLMs can help with depth/quality. Is additive to your own thinking. -> quality approach

For other things. I take a quantity approach. Having 8 subagents research, implement, review, improve, review (etc) a feature in a non critical part of our code, or investigate a bug together with some traces. It’s displacing my own thinking, but that’s ok, it makes up for it with the speed and amount of work it can do. —> quantity approach.

It’s become mostly a matter of picking the right approach depending on the problem I’m trying to solve.

carlosjobim · 2026-03-08T14:17:49 1772979469

Why would you even read philosophy if you're then consulting a third party for interpretation? That is the definition of meaningless.

That is like listening to music and asking somebody if you liked a song.

therealdrag0 · 2026-03-09T01:58:37 1773021517

Philosophy is often considered an activity of engagement with others in thought and discussion. I don’t see why an LLM can’t play a role there.

tylervigen · 2026-03-08T14:30:11 1772980211

I would say it's more like enjoying a song so much that you choose to listen to a cover of that song.

mark_l_watson · 2026-03-07T13:39:57 1772890797

Vicky’s writeup is interesting, but to me the most interesting thing is Jeff Dean’s advice that sometimes doing a linear scan is the fastest approach (over any kind of indexing). This is basic advice, but modern developers might be pre-disposed to use index-based tools or data stores because the tech is now so good and ubiquitous.

xyzzy_plugh · 2026-03-07T16:48:44 1772902124

Modern developers are predisposed to reach for off the shelf solutions, full stop. They're afraid of, or perhaps allergic to, just reading and writing files.

If you can learn to get past this you can unlock a whole universe of problem solving.

mark_l_watson · 2026-03-07T13:25:49 1772889949

I think that I understand you. I started programming in the mid-1960s as a kid and now in my mid-70s I have been retired for two years (except for occasional small gigs for old friends). Nothing special about me but I have had the pleasure of working with or at least getting to know many of the famous people in neural networks and AI since the mid-1980s.

My current passion is pushing small LLMs as far as I can using tools and agentic frameworks. The latest Qwen 3.5 models have me over the moon. I still like to design and code myself but I also find it pleasurable to sometimes use Claude Code and Antigravity.

tqwhite · 2026-03-07T13:50:11 1772891411

I had dinner with Marvin Minsky once. Learned to program a Symbolics machine. We share a little history, I think. I've been interested in AI for the last forty years.

I decided that applications of AI were where I am going. I feel the pull of small LLMs. The idea of local is very appealing. But, at our age (I also started in the sixties), I've learned that too many irons in the fire means I get nothing done.

Congratulations on retaining your spirit. Many of my age-appropriate friends cannot comprehend the idea of working so hard for fun.