I think the way courses are taught can give you some needed grounding, like you should always take a good linear regression class. But I think that is as far as it gets you, a theoretical base.
Honestly the issue is that most ML programs are taught as being some kind of additive skill set: the more courses you take the better or selection of the right kind of courses gets you some where.
In reality:
1. most real world problems are also about subtraction knowing what not to try and why it might not work. Like when I ask people about Recommendtaion engines for recommending colocated things, people pile on embeddings, in reality its about finding good false negatives to train datasets, calibration of classifier output and those are really hard problem. Embeddings may be necessary but are the least of your worries.
2. Most companies will not teach you about the fundamentals of stats; you will be lucky if you can get a mentor in a company that has both the theoretical rigour and the practical implementation skill to solve problems.
3. Most ML problems require engineering to work as well, for example you can't use Bayesian MCMC to do most things at scale. Its why Topic models that used statistical models like simulating posterior were crazy expensive on large datasets.
4. Models are taught like an end, but courses don't teach you to mix them for debugging. They are usually a means to an end for example say you are using decision trees and your models are acting up, you could still try some debugging techniques from linear regression like residual analysis or plotting variable slopes of each variable vs Y to debug before jumping for shapley values.
The reason is not that using shapley values is bad, they are great, but you can get a lot of insight by having some base models that are simpler to debug.
I think this is because of a misalignment that is even common in plenty of other subjects as well. You know how once you've gained expertise in something that it is difficult to explain because it is so obvious? Kinda what is happening in education. Let me explain.
The reason a lot of the theoretical basis is taught is because you need to get the skills to learn why things work, when to use them, when they fail, when not to use them, and __most importantly__ their limitations. The problem is, most of this isn't explained explicitly. Maybe just this process happening for a few decades and momentum. Or that teaching isn't a priority and so no one tries to fix it. (there are exceptions to this. You've all probably met professors that are outstanding and make boring things seem fascinating)
But what you're talking about is part of this "when to use, what to use" part. It is also why those classes are so boring, because they aren't properly motivated. But it is also why we're running into so many problems: because evaluation is fucking hard. You see models perform really well on research papers but not in the real world but you'll also see researchers evaluating papers purely on singular benchmarks. "In reality" you're forced to come to terms with the limitations of the limitations of datasets, as datasets are just proxies and what you are about is the actual generalization. But if we're not discussing and evaluating on actual generalization in research then we get this dichotomy.
There's definitely more efficient (tractable) posterior estimators that work at large scale but just a lot of stuff isn't really known unless you're in that niche yourself. Statistics is often taught from the reference of "here's a bunch of tools and when to use them" rather than "here's the problems, our assumptions, and the main tool we use to solve them. It looks different in different settings, but they are actually the same thing." So it is kinda problematic, but then again, to get there requires a lot more work and most people aren't going to bother with things like metric theory. So a middle ground approach is taken and it gets jumbled.
The people with experience also aren't necessarily the ones that end up teaching - which isn't to say the same information can't be conveyed necessarily (e.g. good academics keep up to date with the field in industry) but there is powerful focus that practical experimentation brings.
I know. I love to teach too. But I’d never take it up as a profession. The vast majority of successful people who have a yearning to pass on their knowledge hand select a few protégés or write a book.
I have had one teacher who was like that. He’d been involved in the development of nuclear weapons before he retired. Incredibly smart guy. Unfortunately, he couldn’t teach physics worth a dime. He had the highest drop out rate of any physics teacher at my college.
Those two scenarios cover the vast majority of cases.
> most real world problems are also about subtraction knowing what not to try and why it might not work
This is true in most fields. I view school as giving you a broad overview of everything that you might need in your field, but for any given problem it will be on you to narrow it down to the solutions you actually need and then to learn that specific set of solutions well enough to apply it.
People fresh out of college will usually try to apply everything all at once until they learn—either from a mentor or their own hard experience—to filter it down. It might be that ML has it worse than other fields right now not because it's taught wrong but because it's new enough that there aren't enough mentors with decades of war stories.
I don't know about ML but if you want to learn applied stats I would look up andrew gelman's or one of the newer books on Bayesian Inference ones using Stan and do them cover to cover.
There are a lot of courses out there on AI from esteemed institutions at that. What do people recommend as a curriculum for someone with a formal univ education in CS albeit from a while ago and who has programmed extensively though not in Python.
The goal at the end is to have a deep understanding of the LLM space and its adjacencies.
Seconded. It's a hands-on approach starting with implementing a pytorch-like api from the ground up with manual backprop up to implementing a simple transformer / gpt variant in actual pytorch.
Would you mind sharing, what level of programming & mathematical background do I need? I know basic python (read python for data analysis) & currently half way through elements of statistical learning. What else do I need to learn?
> half way through elements of statistical learning
If you're able to make it that far on ESL, mathematical background certainly won't hold you back when learning anything "neural networks" related. Specially not a spelled-out practical intro.
> I know basic python (read python for data analysis)
You may want to get more comfortable with programming in general (outside of the data analysis realm), but you can learn everything you're missing while watching Karpathy's series (and referencing the python docs).
> The goal at the end is to have a deep understanding of the LLM space and its adjacency.
This is kinda a hard thing to quantify. How are we defining deep? Like you want to understand how they work? The Karpathy videos are good for that. But I wouldn't call this "deep".
If you want to get down into the weeds and into the mud, you need a hell of a lot more than 13hrs of education. You're also going to have a hard time doing this because most people are going from an engineering perspective of "enough to work with it" rather than "I fundamentally want to understand all inner workings". If you are the former, then the fastai course and others are great for you. If you want to really get deep though, you're going to need a lot more than programming. You're going to need some pretty advanced maths too: high dimensional statistics, metric theory, and optimization theory are some. (Most researchers aren't doing this btw) But if you do go down this path you'll also be able to understand the full spectrum of generative models and have a clearer picture. But I should also say that there is still a black box element to these models as they are so large that they are near impossible to analyze. But it is definitely achievable to learn a 2 layer transformer autoregressive network and fully understand its inner workings. But programming skills alone won't get you there.
Thanks for the helpful advice. What would you recommend to someone who is interested in learning about diffusion models? I have a CS degree but I have 0 knowledge about AI. Things like Stable Diffusion have blown my mind and I’m really interested in learning about this field. Lots of courses out there but I lack the expertise to discern which one is good.
Yeah no problem, this is even closer to my area of focus! What do you know about physics and thermodynamics?
I'd say a good intro for low background is from Tomczak[0]. He has a book, but the blog posts are nearly identical. He did a post doc with Max Welling (someone you should learn about if you want to get deep, like I was suggesting before). So I'd switch things up slightly. I'd go Intro -> Autoregressive -> Flow -> VAE -> Hierarchical VAE -> Energy Based Models -> Diffusion. It is worth learning about GANs btw, but this progression should be natural and build up.
Continuing from there, you're going to want to learn about things Langevin Dynamics, Score Matching, and so on. Start with Yang Song's blogs[1]. Your goal should be to understand this paper[2]. Once you get there, you should be able to understand the famous DDPM paper[3]. But why we went through Tomczak wasn't just to get a good understanding of diffusion at a deeper level, but because you need these tools to understand Stable Diffusion which really is just Latent Diffusion[4]. This should connect back with Tomczak's 2 Improving VAE papers and you should also be able to understand NVAE.
This is probably the quickest way to get you to a good understanding but if you want to dig deeper, which I highly encourage (because there are major issues that people aren't discussing) then you'll need more time. But you'll probably have to tools to do so if you go through this route. Other people I suggest looking into: Diederik Kingma, Ruiqi Gao, Stefano Ermon, Jonathan Ho, Ricky T. Q. Chen, and Arash Vahdat.
I would start with a fastai course such as practical deep learning for coders.
After doing one of the fastai courses you will have some applied Python project experience and you can hone in deeper on a particular part of the project you are more interested in intellectually.
How long has it been since you studied/used university-level math? Calculus and linear algebra in particular.
I ask because it’s pretty difficult to get through the math of backprop without a firm grasp of these. The Python part is trivial by comparison, the main difficulty being the matching of dimensions.
While many learn calculus in high school, many also don't get it till uni. Not everyone is at your level, or took your same path, and that's okay. Don't shame people for not knowing when they're trying to learn.
I think you are asking specifically about practical LLM engineering and not the underlying science.
Honestly this is all moving so fast you can do well by reading the news, following a few reddits/substacks, and skimming the prompt engineering papers as they come out every week (!).
I strongly recommend playing with the OpenAI APIs and working with langchain in a Colab notebook to get a feel for how these all fit together. Also, the tools here are incredibly simple and easy to understand (very new) so looking at, say, https://github.com/minimaxir/simpleaichat/tree/main/simpleai... or https://github.com/smol-ai/developer and digging in to the prompts, what goes in system vs assistant roles, how you guide the LLM, etc.
This might be taboo, but I used ChatGPT to educate me on some basic concepts, then deeper concepts - and put together a learning-plan and a syllabus for me with also a glossary of terms....
The cool thing, is it helped me put a more structured thought process on how I should pursuing AI leanings...
I couldnt find anything concise out there - and this helped me to better think through everything:
If anything - its a good primer for getting your own thought process on the subject going...
Did you click the 'share' link on the left (which generates a public URL for a snapshot of the chat), or did you just copy the URL from the address bar (which will only work for you)?
The shareable URLs contain the word 'share' in the path.
Define “deep understanding” here? You certainly have to lean python, at least because you are gonna need it for data manipulation and cleaning no matter what you do in this field.
I've moved from "traditional" software engineering to a role of working with ML (building + deploying models used in product features) and of the team I work with - and my extended communication with developers at other companies making the same transition - every single person has said the Francis Chollet book (Deep Learning with Python) is all they really needed.
It walks a very thin line between too little info and *just* enough to get you to the point where you know what you don't know (the productive point) and it explains the Math in code samples. It really is a very good way of teaching. When I was reading, I thought the theory covered was too far from the Mathematical base, but I found my self being surprised at how I could hold my own in discussions that moved in to theory.
That said, this likely won't be enough for you to be a researcher - but I imagine for a lot of people tempted by courses like the OP - that isn't the actual end goal anyway.
Thanks for the book recommendation! I’m interested in making the same transition. Can I ask what you did to be considered for a role in ML coming from a software engineering background? Did you showcase any personal projects in your resume?
So this might not work for you, but I will tell you my path anyway.
1. Was one of the first members of the #ai slack channel inviting some people I had in person conversations about AI with.
2. I posted _a lot_ in there. Stuff about regulatory updates, people using co-pilot, cool github repos, little demo projects I was working on.
3. Now this was pure luck and probably the best thing to push me over the boundary, there was a hackathon. I thought "Hmm if I make a kick ass demo showcasing generative AI here, a lot of high up people will see it" - that 100% happened, CTO reached out to me saying demo was great and that people will be in touch.
4. I started really digging in to how I could provide value to our existing data team - be that code, deploying things, bringing some of my engineering know how to that team. This point the #ai channel really started to grow and the head of data and engineering started talking to me and directing people my way based on what they saw at the hackathon.
5. Did a demo of my hack in the company all hands which the CEO was MC'ing.
6. Started having fortnightly 1 to 1s with head of data at this point
7. Floated idea of team taking a little subset of good and motivated people from other teams for a short time to investigate and implement LLMs in some small way into our apps. That team has now grown to effectively investigate any and all use cases (internal and external) for generative AI.
8. I started reading more theory and also following a bit of a road map for things I should learn to have a better picture of how to actually bring LLMs in some form to production (fine-tuning, vector dbs, functions, guard rails).
9. Now I am just building some quick feature in the mobile app to show case the value of the team to exec as quick as I can, which should give us few months cover to work on the thing I am really interested in - multi-arm bandit LLM that uses our existing models.
This was pretty much it. Seems trivial, but in between each points was lots of reading, tinkering, working on weekends, but its totally possible. The ML + AI focused PhD's in your company likely need help from engineering but don't know it - bringing those two groups together quickly shows how you can be useful.
It's one of the best courses to take if you want to obtain some fundamental understanding of the mathematical concepts behind AI.
Yes there's much more to it than NNs/transformers/'Attention is all you need' paper/whatever else is trendy right now. No, don't expect to do important, as in employable, work if you won't be spending some time truly understanding the mathematical foundations.
I hate academic trend following as much as anyone, but is it really true people are not employable in this space if they don't understand mathematical foundations? Sure, they're not getting a job at DeepMind, but it feels like there are many successful ML grifters these days too. Maybe I'm just on Twitter too much.
> Yes there's much more to it than NNs/transformers/'Attention is all you need' paper/whatever else is trendy right now. No, don't expect to do important, as in employable, work if you won't be spending some time truly understanding the mathematical foundations.
There's also Coursera courses that are much of the same content (taught by Andrew Ng as well in many cases). They have specializations for Machine Learning [1], Deep Learning, etc. These are paid via Coursera subscription, but financial assistance is available
I recently completed the specialization with Andrew Ng and think it’s a fantastic introduction to ML. It has a good blend of theory, practical tips, and coding.
If anyone is interested, I’ve published detailed notes and my submissions for the lab assignments:
Honestly, Andrej Karpathy's video series on YouTube is good enough to keep me from even looking at for-profit courses. That attitude might change as I get further along in them, but for now I'm a big fan of his pedagogical approach.
You might be better off looking at MIT OCW search, and selecting video lectures, looking at standford youtube, checking out the 2019 videos https://ai.stanford.edu/stanford-ai-courses/
Most of these look like they're just the slides and syllabus, correct me if I'm wrong.
I am in no way affiliated to Stanford. I don't think you can take the courses for free, but you can sure as hell read through the slides for many of the courses. Cheers!
I would not start with any course unless you had a project in mind that would take the knowledge from the course to get started. Otherwise you risk wasting a lot of time for knowledge that won't help you in any way and will get outdated in a few months anyway
I took a deep learning course in late 2019, during which we implemented transformers as described in Attention is All You Need and fine-tuned GPT-2. The output was amusing, but useless, but I still remember the basic principles.
Now, a few years later, transformers are the tech, and GPT-2's successors are the most hyped technologies of the century so far.
All of which is to say that I wouldn't assume that coursework without immediate application is useless. I'm in a much better position to jump in on the latest AI stuff than I would be if I hadn't taken that course.
I mean if someone likes the subject, then they should invent a project that forces them to build something or write something up. Otherwise its easy to "fake" learning by doing the motions on a bunch of tutorials and quizzes.
It really depends how good the course is, with well-designed problem sets/project prompts you can't really fake learning (assuming you actually complete them). Is it going to be totally exhaustive of everything you may need to know in practice? Obviously not, but no single project will be either, especially not for such a broad field as machine learning.
Independent projects can definitely be a great way to learn, and yes many courses are shitty. But it is also very possible to take a good course and walk away with new knowledge you didn't even realize you needed. Some of my favorite projects actually started with an idea from a course, and then I learned even more in order to further expand on it. Synergy between project-driven and course-driven education can be a powerful iterative process.
Are there project-based tutorial that talks more about neural net architecture, hyperparameters selection and debugging? Something that walks through getting poor results and make explicit the reasoning for tweaking?
When I try to use transformers or any AI thing on a toy problem I come up with, it never works. Even Fizz-Buzz which I thought was easy doesn't work (because division or modulo is apparently hard to represent for NNs). And there's this blackbox of training that's hard to debug into. Yes, for the available resources, if you pick the exact same problem, the exact same NN architecture and exact same hyperparameters, it all works out. But surely they didn't get that on the first try. So what's the tweaking process?
Somehow this point isn't often talked about in courses and consequently the ones who've passed this hurdle don't get their experience transferred. I'd follow an entire course on this if it were available. An HN commenter linked me to this
which is exactly on point. But it'd be great if it were one or more tutorials with a specific example, wrapped in code and peppered with many failures.
There's no great answer to this question. It is a bunch of tricks. Fundamentally:
If you're saying FizzBuzz doesn't work, presumably you mean that encoding the n directly doesn't work. Neither does encoding n from 0 to 1 or between -1 and 1 (and don't forget: obviously don't use relu with -1 to 1). It doesn't.
Neural networks can do a LOT of things, but they cannot deal with numbers. And they certainly cannot deal with natural or real numbers. BUT they can deal with certain encodings.
Instead of using the number directly, give one input to the neural network per bit of the number. That will work. Just pass in the last 10 bits of the number.
Or cheat and use transformers. Pass in the last 5 generations and have it construct the next FizzBuzz line. That will work. Because it's possible.
To make the number-based neural network for FizzBuzz "perfect" think about it. The neural network needs to be able to divide by 3 and 5. They can't. You can't fix that. You must make it possible for the neural network to learn the algorithm for dividing by 3 and 5 ... 2, 3 and 5 are relative primes (and actual primes). So "cheat" and pass in numbers in base 15 (by one-hot encoding the number mod 15 for example).
PM me if you'd like to debug whatever network you have together over zoom or Google meets or whatever.
This may be catastrophically wrong. I only have a master's in machine learning (a European master's degree, meaning I've written several theses on it (didn't pass first time, had to work full time to be able to study), and I was writing captcha crackers using ConvNets in 2002. But I've never been able to convince anyone to hire me to do anything machine learning related.
Thanks for answering, what you wrote here is exactly the sort of thing I'm talking about. Something implicit that's known but not obvious if you look at the first few lectures of the first few courses (or blogs or announcements, etc).
You mention bag of tricks and that's indeed one issue but its worse than that because it includes knowing what "silent problems" needs a trick applied to it in the first place!
Indeed, despite using vectors everywhere, NN are bad with numerical input encoded as themselves! Its almost like the only kind of variables you can have are fixed size enums. That you then encode into vectors that are as far apart as possible, and unit vectors ("one hot vectors") do this. But that's not quite it and sometimes you can still some meaningful metric on the input that's preserved in the encoding (example: word embeddings). And so its again unclear what you can give it and what you can't.
In this toy example, I have an idea of what the shape of the solution is. But generally I do not and would not know to use a base 15 encoding or to send it the last 5 (or 15) outputs as inputs. I know you already sort of addressed this point in your last few paragraphs.
I'm still trying out toy problems at the time so it might be a "waste" of your time to troubleshoot these but I'm happy to take you up on the offer. HN doesn't have PMs though.
Do you remember when you first learned about the things you are using in your reply here? Was it in a course or just asking someone else who worked on NN for longer? I learned through by googling and finding comment threads like these! But they are not easy to collect or find together.
> This may be catastrophically wrong. I only have a master's in machine learning (a European master's degree, meaning I've written several theses on it (didn't pass first time, had to work full time to be able to study), and I was writing captcha crackers using ConvNets in 2002. But I've never been able to convince anyone to hire me to do anything machine learning related.
Oh wow, those are great credentials. I'm surprised that you haven't run across a position yet. Maybe it is a matter of your location? It seems like a lot of these jobs want onsite workers, which can be a real problem.
TBH, I get the feeling that a lot of us without such credentials are in a similar position right now. Slowly trying to work our way towards what seems to be a big new green field, but having a really unclear path to getting there...
Yes. I created a course which uses implementing Stable Diffusion from scratch as the project, and goes through lots of architecture choices, hyperparam selection, and debugging. (But note that this isn't something that's fast or easy to learn - it'll take around a month full-time intensive study.)
https://course.fast.ai/Lessons/part2.html
Thanks for making that course. It was on my list of courses to look at since GPT-4 recommended it (with all the caveat that entails :) ). Thanks for also making notebooks available alongside the videos.
However, can you point me to the lectures where training happen (and architecture choices, hyperparam selection, and debugging happens.). I'm less familiar with SD but at a quick glance it seems like we're using a pretrained model and implementing bits that will eventually be useful for training but not training a new model, at least in the beginning of the deep dive notebook and first few lessons (starting at part 2, lesson 9).
If I search the intro to robotics course online, I see there is a playlist from Stanford but the videos are 14 years old. Does anyone know if there are more recent videos? Or are there other courses that are good for robotics?
entropy of a single signal, say a sequence of letters, "ababababab" is the scaled "average" surprise per letter. So if they are uniformly distributed, each letter is equally likely/unlikely to come next in the sequence, where if instead one letter only 1/1000th of the time (aaa....aaa...aa..a.z.aaaa), then when the rare beast shows up, it is a big surprise, so the total amount of surprise available in the sequence is high.
That's entropy.
The same thing would be true for a sequence of numbers.
But what if there is some relationship? if
aaabaa occurs frequently with 111211, if you line up the sequences by timestamp?
In this simple case, if you know the letters and you can spot the relationship, then there is zero surprise in the number sequence. The cross entropy "letters plus numbers" has the same entropy as "letters" or "numbers" in isolation.
And as you move away from the 1:1 correspondence, you'll see the cross entropy increase until it reaches its max at "entropy(letters) + entropy(numbers)" -- no information shared between the two systems.
To bring it home, I think of cross entropy as the amount of information shared between two signals.
> if instead one letter only 1/1000th of the time (aaa....aaa...aa..a.z.aaaa), then when the rare beast shows up, it is a big surprise, so the total amount of surprise available in the sequence is high
…when a Bernoulli distribution is skewed, the maximum surprise is high, yes, but the average surprise (= entropy) is low. The entropy of a Bernoulli distribution is maximized when p = 0.5 and falls off to either end:
For your examples, if the sequence is uniformly distributed (Bernoulli(1/2)), the entropy is log(2) ≈ 0.693 bits per symbol; if instead one letter occurs 1/1000th of the time, the entropy is about 0.0079 bits per symbol.
Honestly the issue is that most ML programs are taught as being some kind of additive skill set: the more courses you take the better or selection of the right kind of courses gets you some where.
In reality:
1. most real world problems are also about subtraction knowing what not to try and why it might not work. Like when I ask people about Recommendtaion engines for recommending colocated things, people pile on embeddings, in reality its about finding good false negatives to train datasets, calibration of classifier output and those are really hard problem. Embeddings may be necessary but are the least of your worries.
2. Most companies will not teach you about the fundamentals of stats; you will be lucky if you can get a mentor in a company that has both the theoretical rigour and the practical implementation skill to solve problems.
3. Most ML problems require engineering to work as well, for example you can't use Bayesian MCMC to do most things at scale. Its why Topic models that used statistical models like simulating posterior were crazy expensive on large datasets.