More

mrlongroots · 2026-05-14T21:45:55 1778795155

> Yes, they don't realize it or lie to themselves because ~50% dropout.

I think there's some misinterpretation here. Not staying on in academia after PhD (common/modal) is not the same as not getting to complete a PhD (rare).

In CS/tech, those who exit academia after PhDs get paid $300K-$500K in the industry. I don't think there's any misleading going on.

caminante · 2026-05-16T00:22:13 1778890933

>is not the same as not getting to complete a PhD (rare)

BTW, your perspective is bizarre.

Not sure where you're getting the idea that PhD candidate attrition is rare. Maybe at MIT where only 20% don't finish (within 10 years -- which is generous), but these are already pre-screened superstars. Most other places converge around 50%.

As for salaries, the median salary for CS PhDs outside academia is $180k. That means a lot are lower and probably aren't working at big tech with full comp pushing them above $300k. [0]

[0] https://ncses.nsf.gov/pubs/nsf26312

fc417fc802 · 2026-05-14T21:48:56 1778795336

PhD programs have remarkably high attrition rates prior to graduation (ie dropout). I don't know that it's 50% and obviously it varies by institution and field but it's quite large.

caminante · 2026-05-14T22:33:25 1778798005

>In CS/tech, those who exit academia after PhDs get paid $300K-$500K

Yes, I'd like to see data on what percentile gets this and breaks even for lost wages from their PhD years. IMHO, it's not fair to generalize this outcome. I could be wrong.

mrlongroots · 2026-05-14T20:54:07 1778792047

As someone who graduated with a 7.5 year long PhD last month,

I feel like PhD stipends are not a major problem. Like I got $40K in a low CoL area, but accounting for tuition and overheads I cost my advisor closer to $150K/year.

Now why are tuition and overheads that high is a reasonable question and it ties into inefficiencies in broader American administrative processes, but I cost society and taxpayers $150K/year, and that I'm doing it for societal benefit is honestly only partly true. The first 6 years was just me building real skills and letting myself be frustrated, and maybe in the last 1.5 years I did things that justify the $1M bill and more.

Even if I did eventually do things that justified the $1M bill, I think most students don't. The larger value IMO lies in a workforce trained in the failures and frustrations of grad school. While I could rattle of plenty of limitations of academia/grad school, I'm not entirely convinced that me being shortchanged/underpaid was one of those things.

wyldberry · 2026-05-15T03:01:05 1778814065

It's great that you recognize that the last 1.5 years were the period you feel like you did things to justify that bill. However, much like juniors everywhere, you justify all of your pay because we are not paying you for your skill at that moment, but for who you will become.

Even more so for PHD work because the expectation is that after the training you will produce many things that make the cost of training you essentially negligible.

mrlongroots · 2026-04-22T14:43:58 1776869038

That training is compute-bound and inference is memory-bound is well-known, but I don't think Nvidia deployments typically specialize for one vs the other.

One reason is that most clouds/neoclouds don't own workloads, and want fungibility. Given that you're spending a lot on H200s and what not it's good to also spend on the networking to make sure you can sell them to all kinds of customers. The Grok LPU in Vera Rubin is an inference-specific accelerator, and Cerebras is also inference-optimized so specialization is starting to happen.

mrlongroots · 2026-04-03T14:44:29 1775227469

MapReduce is nice but it doesn't, by itself, help you reason about pushdowns for one. Parquet, for example, can pushdown select/project/filter, and that's lost if you have MapReduce. And a reduce is just a shuffle + map, not very different from a distributed join. MapReduce as an escape hatch over what is fundamentally still relational algebra may be a good intuition.

mrlongroots · 2026-04-03T14:40:43 1775227243

Algebras are also nice for implementations. If you can decompose a domain into a few algebraic primitives you can write nice SIMD/CUDA kernels for those primitives.

To your point, I wonder if the 73 distinct transforms were just different defaults/usability wrappers over these. And you may also get into situations where kernels can be fused together or other batching constraints enable optimizations that nice algebraic primitives don't capture. But that's just systems---theory is useful in helping rethink API bloats and keeping us all honest.

hermitcrab · 2026-04-03T15:38:50 1775230730

They are effectively highly level wrappers over the most primitive operations. High enough level that they can be used from a GUI, rather than code.

It is a balance. Too few transforms and they become to low level for my users. Too many and you struggle to find the transform you want.

jimbokun · 2026-04-03T16:41:42 1775234502

You don’t have to limit the transforms you offer users to just the core ones. But for your own sanity you can implement the none core ones in terms of the core ones.

mrlongroots · 2026-02-06T20:08:50 1770408530

Yes, GPT5-series thinking models are extremely pedantic and tedious. Any conversation with them is derailed because they start nitpicking something random.

But Codex/5.2 was substantially more effective than Claude at debugging complex C++ bugs until around Fall, when I was writing a lot more code.

I find Gemini 3 useless. It has regressed on hallucinations from Gemini 2.5, to the point where its output is no better than a random token stream despite all its benchmark outperformance. I would use Gemini 2.5 to help write papers and all, can't see to use Gemini 3 for anything. Gemini CLI also is very non-compliant and crazy.

mrlongroots · 2026-01-23T16:43:09 1769186589

While Arrow is amazing, it is only the C Data Interface that can be FFI'ed, which is pretty low level. If you have something higher-level like a table or a vector of recordbatches, you have to write quite a bit of FFI glue yourself. It is still performant because it's a tiny amount of metadata, but it can still be a bit tedious.

And the reason is ABI compatibility. Reasoning about ABI compatibility across different C++ versions and optimization levels and architectures can be a nightmare, let alone different programming languages.

The reason it works at all for Arrow is that the leaf levels of the data model are large contiguous columnar arrays, so reconstructing the higher layers still gets you a lot of value. The other domains where it works are tensors/DLPack and scientific arrays (Zarr etc). For arbitrary struct layouts across languages/architectures/versions, serdes is way more reliable than a universal ABI.

mrlongroots · 2025-12-09T18:19:11 1765304351

Hyperscalers do not need to achieve parity with Nvidia. There's a (let's say) 50% headroom in terms of profit margins, and plenty of headroom in terms of the complexity custom chip efforts need to implement: they don't need the complexity or generality of Nvidia's chips. If a simple architecture allows them to do inference at 50% of the TCO and 1/5th the complexity and reduce their Nvidia bill by 70% that's a solid win. I'm being fast and loose with numbers and Trainium clearly seems to have ambitions beyond inference, but given the hundreds of billions each cloud vendor is investing in the AI buildout, a couple billion on IP that you will own afterwards is a no brainer. Nvidia has good products and a solid head start but they're not unassailable or anything.

mrlongroots · 2025-12-08T04:53:04 1765169584

Yeah unfortunately no amount of manoeuvering is a substitute for a kill chain where a distributed web of sensors and relays and weapon carriers can result in an AAM being dispatched from any direction at lightspeed.

mrlongroots · 2025-11-14T03:25:54 1763090754

Yep I think the value of the experiment is not clear.

You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.

Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.

justincormack · 2025-11-14T08:41:40 1763109700

Network bandwidth is not 20x storage ant more. An SSD is around 10GB/s now, so similar to 100Gb ethernet.

mrlongroots · 2025-11-14T16:28:47 1763137727

I think I'm talking about cluster-scale network bisection bandwidth vs attached storage bandwidth. With replication/erasure coding overhead and the economics, the order of magnitude difference still prevails.

I think your point is a good one in that it is more economics than systems physics. We size clusters to have more compute/network than storage because it is the design point that maximizes overall utility.

I think it also raises an interesting question in that let's say we get to a point where the disparity really no longer holds: that would justify a complete rethinking of many Spark-like applications that are designed to exploit this asymmetry.

wtallis · 2025-11-14T16:25:49 1763137549

And that's for one SSD. If you're running on a server rather than a laptop, aggregate storage bandwidth will almost certainly be higher than any single network link.

mrlongroots · 2025-11-14T16:30:54 1763137854

The appropriate comparison point for aggregate cluster storage bandwidth would be its bisection bandwidth.

(I do HPC, IIRC ANL Aurora is < 1PB/s DAOS and 20 PB/s bisection).