HN2new | past | comments | ask | show | jobs | submitlogin
Stupid Data Miner Tricks: Overfitting the S&P 500 (nerdsonwallstreet.typepad.com)
70 points by herrherr on March 25, 2011 | hide | past | favorite | 17 comments


Does interpolation ever work in forecasting?

My gut instinct would be that markets and human systems are chaotic in nature. Even in the most chaotic systems, if you look at a suitably small sample, you can see some correlations and patterns between different factors which really don't exist. These are mirage correlations.

Take the lorenz attractor as an example. At some points, it will cycle on the same "wing" of the butterfly many times. But betting that it will do it again is a really lousy bet.

Polynomial approximation and curve fitting in general works when we're trying to explicate relationships between variables in a problem space in which we understand causal linkages very well (and they're constant) - it can be really useful in engineering.



Those that want Google PDF viewer links can get them added automatically wherever they go. This greasemonkey script (of several available I'm sure) adds a Google viewer icon after all pdf links: http://homepages.inf.ed.ac.uk/imurray2/code/user_scripts/goo...


Terrible generalization of polynomials is useful for demonstrating overfitting (I've done it myself in tutorials). However, responsible tutorials should mention that the other obvious lesson is that the polynomials (1, x, x², x³, etc) are a terrible set of basis functions for regression. Don't just watch for overfitting, but use a sensible regression model! For complicated fits some methods to consider are: local regression, splines, various artificial neural nets, or Gaussian processes.


So what is a good basis for polynomial regression? I have heard this statement a few times, but I never heard of a good alternative.


I tried not to be that guy and already gave some alternatives for regression.

“Polynomial regression” implies to me that the basis functions are polynomials. I‘ll assume you meant “good basis for a simple fit, maybe by least squares”. More local functions like “radial basis functions” can work well. Or use splines or sigmoidal functions, which saturate to a flat line or linear trend. In some applications Fourier or wavelet bases might be appropriate.

Gaussian process regression is a Bayesian treatment of some basis function models, potentially with an infinite number of basis functions. Artificial neural nets usually use local or sigmoidal basis functions, potentially in a more complicated way.


What's with the [scribd] tag when direct linking to a .pdf file? It's becoming common, but I can't understand it.


When you submit a link to a PDF you implicitly submit it to scribd who then purloin the content and make it available through their interface.

They claim to honor take-down requests, but I've been caught out by this. Before I knew if their distasteful practices I submitted a link, they took a copy, and now they won't take it down because I'm not the copyright holder. The person who is the copyright holder doesn't know, and I've been unable to get in touch, despite trying.

I'm pretty sure they'd be unhappy about it if they knew.

So now I never submit a link to a PDF, I always create a place-holder and submit that, or find some other reference. The existence of these auto-links to scribd is one aspect of HN that's makes me feel genuinely grubby, and you should only ever submit a PDF if you're happy to have it copied without permission.

https://hackernews.hn/item?id=836544


It's just a convenience link for viewing the PDF on scribd rather than downloading / using a plugin.


How is that convenience? Scribd doesn't work (for me) at all because I don't have Flash installed. On other machines, it takes forever to load a page with a whole pile of static Flash embeds, each of which is a sand-trap for my scroll wheel.


But it's a direct link to a pdf, not to the scribd page.


There is a separate link inside the [scribd] tag.


"If the NFL wins, the market goes up, otherwise, it takes a dive. What’s happened over the last thirty years? Well, most of the time, the NFL wins the Superbowl"

Standards of editing have really gone down over the years. The "NFL" always wins the Superbowl...


> The "NFL" always wins the Superbowl...

It does now, but it didn't always. Super Bowls III and IV were won by AFL teams (Jets & Chiefs) before the AFL/NFL merger in 1970. For several years after the merger, it was quite common (though technically incorrect) to call the newly created NFC & AFC "conferences" by their pre-merger acronyms.


TLDR: Correlation != causation; if you have high dimensional data, you can always find a correlation, but it's probably meaningless; polynomial wiggle is a bitch, so don't fit high dimensional polynomials to your data.


Related question: if I despite the warnings fancy my chances at this sort of thing, what sort of historical data can I get? Is free [machine-readable] stock market data easy to come by, or impossible?


I think you can get daily summaries quite easily, eg:

http://www.google.com/finance/historical?q=NASDAQ:GOOG

and from memory when the markets are open you can see the order book stuff:

http://finance.yahoo.com/q/ecn?s=GOOG+Order+Book

But I believe you have to buy the more detailed data from the exchange itself. I have no idea how you do this but as far as I know it costs maybe a few thousand per month.

So more stuff:

http://www.statslab.cam.ac.uk/~chris/links.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: