Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin

The number of parameters to be fit or estimated in a statistical or machine learning model will always be a function of the dimensionality. For example, fitting a Gaussian (a Bell Curve) with a fixed variance of 1.0 to an n-dimensional space requires estimating the mean, which is n parameters. If we want to estimate an nXn covariance matrix for our n-dimensional Gaussian, we need to estimate n+n^2 parameters. For linear or logistic regression, we'll need to estimate n+1 parameters, and so on.

The curse of dimensionality is that if you add dimensions but don't add data, your functions will overfit the data because you don't have sufficient samples to estimate your model parameters. The worst-case estimate is that for n dimensions, you will need on the order of 2^n samples. This comes from the combinatoric increase of relative "distances" as dimensions increase.

This 2^n bound assumes that your data has a high degree of uncorrelated variance across all dimensions. In practice, the curse of dimensionality often isn't a problem. This is because most high-dimensionality data residing in an n-dimensional space actually doesn't have uniform variance across all n dimensions, and can be mapped or otherwise transformed into a k-dimensional subspace where k is much smaller than n with a minimal loss of variance.

Dimensionality reduction approaches include principle component analysis (PCA), minimum message length methods (MML), various feature selection approaches, virtually every clustering algorithm. Anything that removes dimensions while retaining the essential information content will do the trick.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: