The crazy thing is that all these models are just one local minimum, out of a st...

basch · on March 3, 2023

“Brute forcing a really inefficient approximation/estimator” is a good way to summarize it.

It’s like having an overfit equation to a sample of data points, instead of the simpler actual line they fall near.

They end up being black boxes, we have almost no idea how they work inside, and we have no idea how overtrained they are when something simpler could do the same thing.

sva_ · on March 3, 2023

I don't think the term "brute forcing" is an adequate term to describe gradient descent. Brute forcing would be to try all random weights with no system imo.

kir-gadjello · on March 3, 2023

Can "something simpler", for example, code correct function bodies from comments describing functions in natural language? I think people are too quick to dismiss the power of these models.

basch · on March 3, 2023

I am by no means dismissing the power. They are created very chaotically, however. Spaghetti thrown at a wall. They are brute force approximations.

They are wasteful. If LLaMa 13B is as powerful as previous 65B models, that's a significant amount of unnecessary paramaters lost/pruned in just this iterative upgrade alone. How small can they go? The fewest parameters that get the job done 99% as well is the way to go.

There is also the difference between the rules and use of language being directly compressed into the model, vs all the information known to humans compressed into the model. A smaller model that ingests relevant information on the fly (more like Bing, that supplements itself with search), may be less wasteful and perform better.

The current models being released are chosen because "they work" not because they are least fit and most performant optimized.