More

zackangelo · 2026-05-26T16:58:24 1779814704

If you're in SF and weighing this decision, it's easy to get tilted in the buy direction because the rental stock is so horrific. Landlords have very little incentive to update properties or provide basic amenities that people take for granted in other major cities (good luck getting a washer/dryer).

zeroonetwothree · 2026-05-26T18:41:54 1779820914

I wonder if there’s any reason SF has that problem yet almost no other cities do.

zackangelo · 2026-05-20T15:09:32 1779289772

With the 3.5 release, the Plus model was just a rebrand of the open weight 397B. But I suspect that will change going forward. They haven’t released the weights for 3.6 but they did make it available through a few US providers.

zackangelo · 2026-05-19T17:29:14 1779211754

absolutely not, take Kimi K2.6 for a spin

zackangelo · 2026-04-29T17:56:11 1777485371

Isn't Kimi K2.6 natively INT4?

simjnd · 2026-04-29T18:27:39 1777487259

I don't think any models are natively INT4? I wouldn't see the point to nerf the model out-of-the-box.

zozbot234 · 2026-04-29T18:39:00 1777487940

It's not nerfed, it's natively trained at that quantization a.k.a. Quantization Aware Training.

pbgcp2026 · 2026-04-30T06:13:13 1777529593

QAT typically uses BF16/FP32 during the training process to simulate lower precision.

EntityDeletr · 2026-04-30T16:36:41 1777567001

The only model I have seen like that is GPT OSS, natively quantized to MXFP4.

zackangelo · 2026-04-29T12:54:39 1777467279

I don’t think this is true across Blizzard. Overwatch is the best it’s ever been.

JCTheDenthog · 2026-04-29T13:38:16 1777469896

The Steam userbase would appear to disagree, with the recent reviews being mostly negative reviews (and the user reviews for Overwatch have hovered between mixed and negative for years now). And this doesn't appear to be from review bombing by some specific subset of players, the language breakdown shows reviews ranging from mixed to negative in all major language groups (English, Russian, Chinese, etc.).

zackangelo · 2026-04-22T23:35:02 1776900902

I give them a try about twice a year. I write a lot of Rust which should be squarely in their wheelhouse.

This last time I was pleasantly surprised to find they mostly fixed their SSH remote editing support. But then it started truncating rustc inline error messages and I couldn’t figure out how to view the whole thing easily. When you’re just trying to get something done little bits like this can add up quickly. Punted back to Cursor for now.

donmcronald · 2026-04-23T00:19:09 1776903549

I don't like the way remote editing works with plugins. IIRC, the remote agent pulls the plugins from the connecting client. I get why it's done like that, but I'd way rather have it go the opposite direction.

I want a setup where I can have an immutable devcontainer with local copies of everything I need to develop 100% offline; dependencies, tools, etc.. Having my local editor pull plugins from a devcontainer for the project seems to make more sense to me.

I didn't dig in too much. Maybe there's a way to make it work somehow.

zackangelo · 2026-04-16T15:00:42 1776351642

They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

sosodev · 2026-04-16T15:14:20 1776352460

Oh, that’s interesting. Thanks for the correction. I didn’t know such heavily post trained models could still do good ol fashion autocomplete.

zackangelo · 2026-04-16T14:57:47 1776351467

17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.

If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).

When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.

zackangelo · 2026-01-10T23:56:27 1768089387

This uses Nvidia’s CUDA snapshot API under the hood, but you have to pair it with a host side snapshot as well. Modal uses gVisor for this, which is notoriously high overhead.

Does anyone know of a more efficient alternative if you’re running a trusted container?

luiscape · 2026-01-11T17:07:12 1768151232

Post author here: there are other projects that will create a proxy for CUDA calls and use the log of CUDA operations to checkpoint / restore or live migration tasks. We haven’t used them. I don’t believe they are very popular nor used outside specific orgs.

This is the only API available for snapshotting NVIDIA GPU memory, afaik.

As for needing to combine it with a host memory snapshot step, this is required because CUDA sessions need to be mapped to a host process, so you need to snapshot both things in order for the program to be restored correctly.

CRIU is another project that uses the same technique (CUDA snapshot + host memory snapshot). Different than CRIU, our snapshots work at the function level so we’re able to take snapshots after functions have been initialized (including GPU memory), making Modal cold boots fast. One would have to implement this entire process using CRIU.

zackangelo · 2025-12-12T23:05:09 1765580709

Sparks are built for this and actually have Connect-X 7 NICs built in! You just need to get the SFPs for them. This means you can natively cluster them at 200Gbps.

wtallis · 2025-12-12T23:18:33 1765581513

That doesn't answer the question, which was how to get a high-speed interconnect between a Mac and a DGX Spark. The most likely solution would be a Thunderbolt PCIe enclosure and a 100Gb+ NIC, and passive DAC cables. The tricky part would be macOS drivers for said NIC.

zackangelo · 2025-12-12T23:26:39 1765581999

You’re right I misunderstood.

I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.

You could run pipeline parallel but not sure it’d be that much better than what we already have.

storus · 2025-12-13T13:54:55 1765634095

It was about this use case:

https://blog.exolabs.net/nvidia-dgx-spark/