Hacker News new | past | comments | ask | show | jobs | submit login
Why do random forests work? They are self-regularizing adaptive smoothers (arxiv.org)
295 points by sebg 16 days ago | hide | past | favorite | 41 comments



Nice article, although I find it's a bit overly complex if your are not familiar with ML and mathematics at the same time.

I will leave here a geometrical intuition of why they work in case it can help someone:

To simplify things, I will talk about regression, and say we want to predict some value y, that we know depends on x, and we have some noisy measurements of y.

y = f(x) y_observed = f(x) + noise

If you want some mental image, think about f(x)=sin(x).

Now, a (over fitted) regression tree in this case is just a step function where the value at x is y_observed. If there is no noise, we now that by doing more measurements, we can approximate y with as much precision as we want. But if there is noise, the regression tree will over fit the noisy values, creating some artificial spikes.

If you want to avoid this over fitting, you sample a lot of times the values of X, and for each sample you build a regression tree, and then average them. When you average them, every tree will contain its own particular artificial spikes, and if they are noise, they won't appear in the majority of the other trees. So when you average them, the spikes will attenuate, creating the smoother behaviour that the article talks about.

I hope it helps!


This is good intuition for why ensembling overparametrized is a good idea. Doesn’t speak to why ensembles of tree-structured estimators in particular perform so well compared to ensembles of other nonparametric estimators.


If you look at what makes it work well in the example, I would say it is being able to easily approximate a function with whatever degree of precision that you want, which translates to being able to isolate spikes in the approximation.

For example, one could ask, what if instead of an approximation by step functions, we use a piecewise linear approximation (which is as good)? You can do that with a fully connected artificial neural network with ReLU nonlinearity, and if you check it experimentally, you will see that the results are equivalent.

Why do people often use ensembles of tree structures? The ensembling part is included in the programming packages and that is not the case for ANN, so it is quicker to experiment with. Appart from that, if you have some features that behave like categorical variables, trees also behave better in training.


Thanks that helps. The way I think about your example is it’s like (not the same obv) taking a bunch of moving averages of different durations at different starting points, and throwing those into your regression model along with the actual data


So it seems that when you have different sources of errors the average of them cancel the noise. I think some property about the sources of error is necessary so in some sense the sources should be independent.


It would be also interesting to see the limitations of random forest and where they struggle to learn and produce good results.


I get pretty abysmal results for scarce categories even though they have well defined preconditions.


Some overlay lots of really squiggly lines, average (most points in common) is the actual function you're looking for?


Good to see more research exploring the connection between trees, ensembles, and smoothing. Way back in Trevor Hastie's ESL book there's a section on how gradient boosting using "stumps" (trees with only one split) is equivalent to an additive spline model (GAM, technically) with a step function as a spline basis and adaptive knot placement.

I've always thought there should be a deep connection between ReLU neural nets and regularized adaptive smoothers as well, since the ReLU function is itself a spline basis (a so-called truncated linear spline) and happens to span the same functional basis as B-splines of the same degree.


I don't know if you have come across this: A Spline Theory of Deep Networks [1]. Has been on my To-Read list forever :(

[1] http://proceedings.mlr.press/v80/balestriero18b/balestriero1...


One of my biggest pet peeves is flagrant overuse of "deep". Everything is so deep around around here these days...

> since the ReLU function is itself a spline basis (a so-called truncated linear spline) and happens to span the same functional basis as B-splines of the same degree.

... you literally just spelled out the entire "depth" of it.


The same team wrote another interesting paper arguing that there's no "double descent" in linear regression, trees, and boosting, despite what many argued before (in this paper they don't tackle deep learning double descent, but remark that the case may be similar regarding the existence of different complexity measures being conflated).



Random forests are incredible cool algorithms.

The idea that you can take hundreds of bad models that over fit (the individual decision trees), add even more randomness by randomly picking training data and features*, and averaging them together - it's frankly amazing that this leads to consistently OK models. Often not the best but rarely the worst. There's a reason they're such a common baseline to compare against.

*Unless you're using Sklearn, whose implementation of RandomForestRegressor is not a random forest. It's actually bagged trees because they don't randomly select features. Why they kept the misleading classname is beyond me.


With a relatively small variant to make it gradient boosted trees it’s pretty much as good as it gets for tabular data these days


TIL sklearn doesn't actually implement random forest. Goos to know, thank you!


It does implement one, just the default settings do not


I like this article. Randomness in system design is one of the most practical ways to handle the messiness of real world inputs, and I think random forests nail this by using randomness to produce useful outputs to various inputs without overfitting and adapt to the unexpected. Yeah, you can always engineer a system that explicitly handles every possible situation, but the important question is “how long/costly will that process be?”. Deterministic systems aren’t good on that front, and when edge cases hit, sometimes those rigid models crack. Controlled randomness (load balancing, feature selection, etc.) makes systems more flexible and resilient. You don’t get stuck in the same predictable ruts, and that’s exactly why randomness works where pure determinism fails


My understanding of why bagging works well is because it’s a variance reduction technique.

If you have a particular algorithm, the bias will not increase if you train n versions in ensemble, but the variance will decrease as more anomalous observations won’t persistently be identified in submodel random samples and so won’t the persist in the bagging process.

You can test this. The difference between train and test auc will not increase dramatically as you increase number of trees in sklearn random forest for same data and hyperparameters.


Any suggestions for a 101 introduction to random forests? In university I encountered some ML courses but never random forests.


Statquest is really good. Maybe a little basic if you’ve taken ML courses, though?

https://youtube.com/playlist?list=PLblh5JKOoLUIE96dI3U7oxHaC...


My book, Effective XGBoost, covers tree theory from stumps to decision trees, to bagging (random forests) to boosting.


Grab any ML book and read the chapter on random forests. If you have the maths background (which is not particularly high for random forests) which you should if you took ML courses, it’s all going to be pretty straightforward. I think someone already mentioned Hastie, The Elements of Statistical Learning, in this thread which you can download for free and would be a good start.


Yeah I did a deep dive on them here -- doesn't require any particular math background beyond high school: https://www.youtube.com/watch?v=blyXCk4sgEg


Here's some context and a partial summary (youoy also has a nice summary) --

Context:

A random forest is an ML model that can be trained to predict an output value based on a list of input features: eg, predicting a house's value based on square footage, location, etc. This paper focuses on regression models, meaning the output value is a real number (or a vector thereof). Classical ML theory suggests that models with many learned parameters are more likely to overfit the training data, meaning that when you predict an output for a test (non-training) input, the predicted value is less likely to be correct because the model is not generalizing well (it does well on training data, but not on test data - aka, it has memorized, but not understood).

Historically, a surprise is that random forests can have many parameters yet don't overfit. This paper explores the surprise.

What the paper says:

The perspective of the paper is to see random forests (and related models) as _smoothers_, which is a kind of model that essentially memorizes the training data and then makes predictions by combining training output values that are relevant to the prediction-time (new) input values. For example, k-nearest neighbors is a simple kind of smoother. A single decision tree counts as a smoother because each final/leaf node in the tree predicts a value based on combining training outputs that could possibly reach that node. The same can be said for forests.

So the authors see a random forest as a way to use a subset of training data and a subset of (or set of weights on) training features, to provide an averaged output. While a single decision tree can overfit (become "spikey") because some leaf nodes can be based on single training examples, a forest gives a smoother prediction function since it is averaging across many trees, and often other trees won't be spikey for the same input (their leaf node may be based on many training points, not a single one).

Finally, the authors refer to random forests as _adaptive smoothers_ to point out that random forests become even better at smoothing in locations in the input space that either have high variation (intuitively, that have a higher slope), or that are far from the training data. The word "adaptive" indicates that the predicted function changes behavior based on the nature of the data — eg, with k-NN, an adaptive version might increase the value of k at some places in the input space.

The way random forests act adaptively is that (a) the prediction function is naturally more dense (can change value more quickly) in areas of high variability because those locations will have more leaf nodes, and (b) the prediction function is typically a combination of a wider variety of possible values when the input is far from the training data because in that case the trees are likely to provide a variety of output values. These are both ways to avoid overfitting to training data and to generalize better to new inputs.

Disclaimer: I did not carefully read the paper; this is my quick understanding.


I think this is specifically coming to terms with an insight that's taught to statisticians about a bias-variance tradeoff.

From my understanding, in a statistical setting, low variability in bias leads to high variability in variance whereas low variability in variance leads to high variability in bias. The example I saw was with K-means, where K = N leads to high variance (the predicted cluster is highly variable) but low bias (take an input point, you get that exact input point back), vs. K=1 low variance (there's only one cluster) but bad bias (input point is far away from the cluster center/representative point).

I'm not sure I've characterized it well but there's a Twitter post from Alicia Curth that explains it [0] as well as a paper that goes into it [1].

[0] https://x.com/AliciaCurth/status/1841817856142348529

[1] https://arxiv.org/abs/2409.18842


Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers https://journals.plos.org/plosone/article?id=10.1371/journal...

Found the GitHub repository code https://github.com/AzazHassankhan/Machine-Learning-based-Tra...

Made some changes from line 9 to 70 . Usee yfinance instead of alpaca Replace all code with code below until line# 70

import plotly.offline as pox

import plotly.graph_objs as go

import numpy as np

import talib as tl

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import talib as ta

from sklearn.model_selection import train_test_split

from sklearn.metrics import

accuracy_score,classification_report

#import alpaca_trade_api as tradeapi

#from alpaca_trade_api import TimeFrame, TimeFrameUnit

from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler

import seaborn as sns

from matplotlib.pyplot import figure

from statsmodels.tsa.stattools import adfuller

from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import AdaBoostClassifier

import yfinance as yf

from datetime import datetime

symb = "TSLA"

start = datetime(2021, 10, 18, 9, 30, 0)

end = datetime(2021, 10, 18, 10, 30, 0)

df =yf.download("TSLA", period="1mo",interval ="15m")

next=df.copy()

next.tail()

df['close']=df['Close']

df['high']=df['High']

df['low']=df['Low']

df['open']=df['Open']

df['volume']=df['Volume']


XGBoost LightGBM Catboost and JLBoost?


Quantized random sampling regression


Cool, more random ess at work


Reading the abstract alone, I have no idea whether it's talking about algorithmic trees or, like, the big brown things with small green bits.


You have to know some machine learning fundamentals to figure that out - “Random Forest” is a specific machine learning algorithm, which does not need a further explanation. To take it a step further, they should really not describe “Machine learning”, no, its not like the machine takes a book and learns, its a term.


I had the exact same reaction: biology or computers?

The only hint I can see anywhere on the page is "Statistics > Machine Learning" above the abstract title.

I really want it to be about actual biological trees being studied on the scale of forests growing with smooth edges over long periods of time, but I suspect that's not what it is about.


There's also "Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)"

Also, the very first sentence of the actual paper (after the abstract) is

> Random forests (Breiman, 2001) have emerged as one of the most reliable off-the-shelf supervised learning algorithms [...]

arxiv.org is overwhelmingly used for math and computer science papers, though not exclusively.

The paper will also likely be submitted to a machine learning venue.


Biological trees don’t make predictions. Second or third sentence contains the phrase “randomized tree ensembles not only make predictions.”


Even single cells are able to sense and adapt to their environment. That is to recognize and react.


The tree is an incredibly common data structure in computer science. Decision trees are well known. Random forests are ubiquitous in Machine Learning. Should the authors really have to dumb their paper down so people who don’t work in this domain avoid confusing it with work in arborism?


Pretty sure the guy you’re replying to was half-joking, but adding the words ‘machine learning’ in the first sentence would have cleared this up pretty simply and wouldn’t have resulted in dumbing down anything.


> avoid confusing it with work in arborism

funny!


It's talking about these[0]

[0]https://en.wikipedia.org/wiki/Random_forest


Jeremy Howard has a great video explaining Random Forests: https://youtu.be/blyXCk4sgEg .




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: