Have you considered rotating the layout? I always though a CJK programming language written vertically would be very ergonomic. Instead of scrolling vertically the program would flow right-to-left or left-to-right. I guess you'd probably want to rotate the bracket/paran glyphs which is a bit less trivial to do
Presumably the idea is that you put the relevant parts of the list in your thesis. You need to convince your examiner that you understand the background to the original research you did, and a solid reference list (with supporting text in the introductory/background section of your thesis) is part of doing that.
Personally I did the references at the end and didn't feel like I suffered from that decision, but the key references in my particular area were a relatively small and well-known set.
Hmm, yeah. I mean you often see huge reference lists which always just makes me feel like the person can't possible be actually well acquainted with the stuff that's being referenced. So who are you really fooling? Seems all very performative, though I guess I understand the motivation
> What’s the deal with “hallucinations”? The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data.
Extremely naiive question.. but could LLM output be tagged with some kind of confidence score? Like if I'm asking an LLM some question does it have an internal metric for how confident it is in its output? LLM outputs seem inherently rarely of the form "I'm not really sure, but maybe this XXX" - but I always felt this is baked in the model somehow
The model could report the confidence of its output distribution, but it isn't necessarily calibrated (that is, even if it tells you that it's 70% confident, it doesn't mean that it is right 70% of the time). Famously, pre-trained base models are calibrated, but they stop being calibrated when they are post-trained to be instruction-following chatbots [1].
Edit: There is also some other work that points out that chat models might not be calibrated at the token-level, but might be calibrated at the concept-level [2]. Which means that if you sample many answers, and group them by semantic similarity, that is also calibrated. The problem is that generating many answer and grouping them is more costly.
In absolute terms sure, but the token stream's confidence changes as it's coming out right? Consumer LLMs typically have a lot window dressing. My sense is this encourages the model to stay on-topic and it's mostly "high confidence" fluff. As it's spewing text/tokens back at you maybe when it starts hallucinating you'd expect a sudden dip in the confidence?
You could color code the output token so you can see some abrupt changes
It seems kind of obvious, so I'm guessing people have tried this
Look up “dataloom”. People have been playing with this idea for a while. It doesn’t really help with spotting errors because they aren’t due to a single token (unless the answer is exactly one token) and often you need to reason across low probability tokens to eventually reach the right answer.
Having a confidence score isn't as useful as it seems unless you (the user) know a lot about the contents of the training set.
Think of traditional statistics. Suppose I said "80% of those sampled preferred apples to oranges, and my 95% confidence interval is within +/- 2% of that" but then I didn't tell you anything about how I collected the sample. Maybe I was talking to people at an apple pie festival? Who knows! Without more information on the sampling method, it's hard to make any kind of useful claim about a population.
This is why I remain so pessimistic about LLMs as a source of knowledge. Imagine you had a person who was raised from birth in a completely isolated lab environment and taught only how to read books, including the dictionary. They would know how all the words in those books relate to each other but know nothing of how that relates to the world. They could read the line "the killer drew his gun and aimed it at the victim" but what would they really know of it if they'd never seen a gun?
I think your last point raises the following question: how would you change your answer if you know they read all about guns and death and how one causes the other? What if they'd seen pictures of guns? And pictures of victims of guns annotated as such? What if they'd seen videos of people being shot by guns?
I mean I sort of understand what you're trying to say but in fact a great deal of knowledge we get about the world we live in, we get second hand.
There are plenty of people who've never held a gun, or had a gun aimed at them, and.. granted, you could argue they probably wouldn't read that line the same way as people who have, but that doesn't mean that the average Joe who's never been around a gun can't enjoy media that features guns.
Same thing about lots of things. For instance it's not hard for me to think of animals I've never seen with my own eyes. A koala for instance. But I've seen pictures. I assume they exist. I can tell you something about their diet. Does that mean I'm no better than an LLM when it comes to koala knowledge? Probably!
It’s more complicated to think about, but it’s still the same result. Think about the structure of a dictionary: all of the words are defined in terms of other words in the dictionary, but if you’ve never experienced reality as an embodied person then none of those words mean anything to you. They’re as meaningless as some randomly generated graph with a million vertices and a randomly chosen set of edges according to some edge distribution that matches what we might see in an English dictionary.
Bringing pictures into the mix still doesn’t add anything, because the pictures aren’t any more connected to real world experiences. Flooding a bunch of images into the mind of someone who was blind from birth (even if you connect the images to words) isn’t going to make any sense to them, so we shouldn’t expect the LLM to do any better.
Think about the experience of a growing baby, toddler, and child. This person is not having a bunch of training data blasted at them. They’re gradually learning about the world in an interactive, multi-sensory and multi-manipulative manner. The true understanding of words and concepts comes from integrating all of their senses with their own manipulations as well as feedback from their parents.
Children also are not blank slates, as is popularly claimed, but come equipped with built-in brain structures for vision, including facial recognition, voice recognition (the ability to recognize mom’s voice within a day or two of birth), universal grammar, and a program for learning motor coordination through sensory feedback.
Yes, the actual LLM returns a probability distribution, which gets sampled to produce output tokens.
[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]
It’s often very difficult (intractable) to come up with a probability distribution of an estimator, even when the probability distribution of the data is known.
Basically, you’d need a lot more computing power to come up with a distribution of the output of an LLM than to come up with a single answer.
In microgpt, there's no alignment. It's all pretraining (learning to predict the next token). But for production systems, models go through post-training, often with some sort of reinforcement learning which modifies the model so that it produces a different probability distribution over output tokens.
But the model "shape" and computation graph itself doesn't change as a result of post-training. All that changes is the weights in the matrices.
The LLM has an internal "confidence score" but that has NOTHING to do with how correct the answer is, only with how often the same words came together in training data.
E.g. getting two r's in strawberry could very well have a very high "confidence score" while a random but rare correct fact might have a very well a very low one.
In short: LLM have no concept, or even desire to produce of truth
Still, it might be interesting information to have access to, as someone running the model? Normally we are reading the output trying to build an intuition for the kinds of patterns it outputs when it's hallucinating vs creating something that happens to align with reality. Adding in this could just help with that even when it isn't always correlated to reality itself.
Uh, to explain what? You probably read something into what I said while I was being very literal.
If you train an LLM on mostly false statements, it will generate both known and novel falsehoods. Same for truth.
An LLM has no intrinsic concept of true or false, everything is a function of the training set. It just generates statements similar to what it has seen and higher-dimensional analogies of those .
Reasoning allows to produce statements that are more likely to be true based on statements that are known to be true. You'd need to structure your "falsehood training data" in a specific way to allow an LLM to generalize as well as with the regular data (instead of memorizing noise). And then you'll get a reasoning model which remembers false premises.
You generate your text based on a "stochastic parrot" hypothesis with no post-validation it seems.
Really, how hard is it to follow HN guidelines and :
a) not imagine straw-man arguments and not imagine more (or less) than what was said
b) refrain from snarky and false ad hominems
None of what you said in no way conflicts with what I said, and again shows a fundamental misunderstanding.
Reasoning is (mostly) part of the post-training dataset. If you add a large majority of false (ie. paradoxical, irrational etc.) reasoning traces to those, you'll get a model that successfully replicates the false reasoning of humans. If you mix it in with true reasoning traces, I imagine you'll get infinite loop behaviour as the reasoning trace oscillates between the true and the false.
The original premise that truth is purely a function of the training dataset still stands... I'm not even sure what people are arguing here, as that seems quite trivially obvious?
Ah, sorry. I haven't recognized "all the high-level capabilities of an LLM come from the training data (presumably unlike humans, given the context of this thread)" in your wording. This is probably true. LLM structure probably has no inherent inductive bias that would amount to truth seeking. If you want to get a useless LLM, you can do it. OK, no disagreement here.
The overwhelming majority of true statements isn't in the training corpus due to a combinatorial explosion. What it means that they are more likely to occur there?
There is this paper that proposed data compression as a way to judge the ability of a LLM to "understand" things correctly, training on older texts and trying to predict more recent articles:
On a high level, do I understand correctly that SIMD is close to how the hardware works, while Vector Processor is more of an abstraction? The "Strip Mining" part looks like this translation to something SIMD-like. I seems like it's a good abstraction layers, but there is an implicit compilation step right? (making the "assembly" more easily run on different actual hardware)
> On a high level, do I understand correctly that SIMD is close to how the hardware works, while Vector Processor is more of an abstraction?
Not quite. It still is the same “process whatever number of items you can in parallel, decrease count by that, repeat if necessary“ loop.
RISC-V decided to move the “decrease count by that, repeat if necessary” part into hardware, making the entire phrase “how the hardware works”.
Makes for shorter and nicer assembly. SIMD without it first has to query the CPU to find out how much parallelization it can handle (once) and do the “decrease count by that, repeat if necessary” part on the main CPU.
RVV still very much requires you to write a manual code/assembly loop doing the "compute how many elements can be handled, decrease count by that, repeat if necessary" thing. All it does is make it slightly less instructions to do so (and also allows handling a loops tail in the same loop while at it).
Yeah, except you don't need to rewrite that code every time a new AVX drops, and also don't need to bother to figure out what to do on older CPUs.
IIRC libc for x64 has several implementations of memcpy/memmov/strlen/etc. for different SSE/AVX extensions, which all get compiled in and shipped to your system; when libc is loaded for the first time, it figures out what is the latest extension the CPU it's running on actually supports and then patches its exports to point to the fastest working implementations.
You don't need to write a new loop every time a new vector size drops, but over time you'll still get more and more cases of wanting to write multiple copies of loops to take advantage of new instructions; there are already a good bunch of extensions of RVV (e.g. Zvbb has a good couple that are encounterable in general-purpose code), and many more to come (e.g. if we ever get vrgathers that don't scale quadratically with LMUL, some mask ops, and who knows what else will be understood as obviously-good-to-have in the future).
This kinda (though admittedly not entirely) balances out the x86 problem - sure, you have to write a new loop to take advantage of wider vector registers, but you often want to do that anyway - on SSE→AVX(2) you get to take advantage of non-destructive ops, all inline loads being unaligned, and a couple new nice instrs; on AVX2→AVX512 you get a ton of masking stuff, non-awful blends, among others.
RVV gets an advantage here largely due to just simply being a newer ISA, at a time where it is actually reasonably possible for even baseline hardware to support expensive compute instrs, complex shuffles, all unaligned mem ops (..though, actually, with RISC-V/RVV not mandating unaligned support (and allowing it to be extremely-slow even when supported) this is another thing you may want to write multiple loops for), and whatnot; whereas x86 SSE2 had to work on whatever could exist 20 years ago, and as such made respective compromises.
In some edge-cases the x86 approach can even be better - if you have some code that benefits from having different versions depending on hardware vector size (e.g. needs to use vrgather, or processes some fixed-size data that'd be really bad to write in a scalable way), on RVV you may end up needing to write a loop for each combination of VLEN and extension-set (i.e. a quadratic number of cases), whereas on x86 you only need to have a version of the loop for each desired extension-set.
I mean, "move in to hardware" is effectively more of a micro code translation/compilation step right? The actual silicon implementation of how things are in-the-end going to be executed on the silicon is not fundamentally rearchitected right?
I'm going to try to read through the full document carefully later :)) Likely it's answered in there
This looks very pretty! I like it and it's very minimalist. This is pretty much entirely outside of my area of expertise, but do you have some library for leaflet that you're using from Clojure? Or just JS interop calls?
It's cool to see sleek projects like this. I made an application that needed to make heatmaps, but I just made a grid of colored squares and cropping some GeoJSON contours and I ended up generating SVGs .. A bit goofy reimplementing a mapping library, but I needed to do some heavy math, so this way it was all JVM code
Using leaflet from Clojure is entirely JS interop calls. You can see an example in my other mapping project here (https://github.com/ahmed-machine/mapbh/blob/master/src/app/p...). I'll add that leaflet 2, while still in alpha, is much nicer to use from CLJS as it replaces all factory methods with constructor calls: (e.g. L.marker(latlng) -> new Marker(latlng)). I've been slowly moving my newer mapping projects over to the new version.
Your project sounds really cool, I'd love to read that code. The implementation in this project largely utilises Leaflet's GeoJSON layers (https://leafletjs.com/examples/geojson/) which does render out to SVGs (there's an optional canvas renderer, too). One of the trickier parts was figuring out how to layer each isochrone band so that those closest to the point (i.e. 15 minute band) were painted on top of the bands further away (https://www.geeksforgeeks.org/dsa/painters-algorithm-in-comp...). That and pre-computing the distances per NYC intersection across the tri-state area which required a lot of messing around with OpenTripPlanner configuration files, GTFS data, and parallelising the code to finish in a reasonable time span (few days).
it's honestly nothing too crazy. I don't have any DB or any API calls or anything.
- The gridded data inputs are all "normalized" to GeoTIFF files (you can using gdal and convert netCDF files easily)
- The Java standard library can handle simple GeoTIFF images with BufferedImage
- I do some math of the BufferedImage data and then plot the results using the thing/geom heatmap plot
- Just heatmaps on their own are kinda confusing. You need another layer so you can visualize "where you are". Here I plot coastlines. (you could also do elevation contours)
- There I have contours/coastlines as GeoJSON polygons. With `factual/geo` you can read them in and crop them to your observed region using `geo.jts/intersection`. You can then convert these to a series of SVG Paths (using `thing/geom` hiccup notation) and overlay it on the heatmap
> and parallelising the code to finish in a reasonable time span (few days)
whats the advantage over just manually making a uberjar and using jlink/jpackage?
Do you have the ability to crosscompile to other architectures/OS?
Do you have the ability to generate a plain executable? the jlink/jpackage route ends up generating an "installers" for each system, which i find hard/annoying to test and people are reluctant to install a program you send them
In the past ive ended up distributing an uberjar bc i didnt have the setup to test all the resulting bundles (esp MacOS which requires buying a separate machine). I also found JavaFX to be a bit inconsistent.. though its been a few years and maybe the situation has improved
The main pain point jbundle solves is that jpackage generates installers (.deb, .rpm, .dmg, .msi), not plain executables. jbundle produces a single self-contained binary — just a shell stub concatenated with a compressed payload. You chmod +x it, distribute it, and the user runs ./app. No installation step, no system-level changes.
It also automates the full pipeline (detect build system → build uberjar → download JDK → jdeps → jlink → pack) so you don't need a JDK installed on the build machine — it fetches the exact version from Adoptium. Plus it includes startup optimizations like AppCDS (auto-created on first run, JDK 19+), CRaC checkpoints, and profile-tuned JVM flags for CLI vs server workloads.
Cross-compilation:
Yes — jbundle build --target linux-x64 (or linux-aarch64, macos-x64, macos-aarch64). Since the JAR is platform-independent, it just downloads the appropriate JDK runtime for the target OS/arch from Adoptium and bundles it. You can build a Linux binary from macOS and vice-versa.
Plain executable (not an installer):
That's exactly what jbundle produces. The output is a single file you can scp to a server or hand to someone. On first run it extracts the runtime and jar to ~/.jbundle/cache/ (keyed by content hash), so subsequent runs are instant. No .deb, no .dmg, no "install this first" — just a binary.
For the macOS testing concern: since it's a CLI binary (not a .app bundle), it doesn't require signing/notarization to run. And with --target macos-aarch64 you can build it from a Linux CI without needing a Mac.
reply