The Little Handbook of Statistical Practice (2012)

mikorym · on Oct 29, 2019

I've often told people in the passing that "if you have 20 parameters and p=0.05 you should expect with random data to have something register as significant". Looks like OP has beaten me to it to illustrate this. [1]

[1] http://www.jerrydallal.com/LHSP/coffee.htm

Nokinside · on Oct 29, 2019

If you have 20 parameters, you should probably compare the effect sizes first. If there are no outliers, p=0.05 or p=0.001 can't make the result important.

https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1...

> no p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern (2006 Gelman, A., and Stern, H. (2006), “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant,” The American Statistician, 60, 328–331. DOI: 10.1198/000313006X152649. ) famously observed, the difference between “significant” and “not significant” is not itself statistically significant.

brookhaven_dude · on Oct 29, 2019

What do you say about neural networks with thousands of parameters?

atlasair · on Oct 30, 2019

Neural networks are just used for prediction, not inference.

sevensor · on Oct 29, 2019

This comes at a good time! I was just talking with my spouse about why you put n-1 in the denominator when you calculate variance. I always liked "because you used up a degree of freedom calculating the mean," but I feel like that's kind of hand-wavy. (Obviously, I'm not a statistician.)

clircle · on Oct 29, 2019

You can argue either way. dividing by n is the maximum likelihood estimator, but dividing by n-1 is the unbiased estimator. Depends on what you like. Most people prefer unbiased estimators when they are available.

sanderjd · on Oct 29, 2019

What does it mean for it to be "unbiased"? What does it mean to "use up" a degree of freedom?

I don't mind if things just can't really be explained intuitively because they are fundamentally technical, but your explanation and the parent's both do this thing where it sounds like it's explaining things in plain common language, but isn't actually because it isn't clear what those plain words mean in this context.

absherwin · on Oct 29, 2019

Unbiased means that if I draw infinitely many random samples from a population and average a statistic (in this case standard deviation) across all the samples, the answer will be the statistic computed from the population itself. If one divides by n instead of n-1, the estimate for standard deviation will be be (n-1)/n too small. One reading this might think, "Wait! We're going to infinity so the ratio converges to 1." That's true if the size of each sample also goes to infinity but not if we draw millions of ten item samples.

As for using up a degree of freedom, the easiest way to build intuition for why this is a useful concept is to think about very small samples. Let's say I draw a sample of 1 item. By definition the item is equal to the mean so I receive no information about the standard deviation. Conversely, if someone had told me the mean in advance, I could learn a bit about the standard deviation with a single sample. This carries on beyond one in diminishing amounts. Imagine I draw two items. There's some probability that they're both on the same side of the mean, in that case, I'll estimate my sample mean as being between those number and underestimate the standard deviation. Note that I'd still underestimate it even with the bias correction, it's just that that factor compensates just enough that it balances out over all cases.

A simple, concrete way to convince yourself that this is real is to consider the standard deviation of a variable that has an equal probability of being 1 or 0. The standard deviation is 0.5. But if we randomly sample two items, 50% of the time they'll be the same and we'll estimate the standard deviation as zero. The other 50% of the time, we'll get the right answer. Hence, our average is half the right answer (n/(n-1)=2/1). The correction makes the standard deviation double what it should be half the same while remaining zero in the other cases. This also suggests why dividing by n is referred to as a the maximum likelihood estimator.

krychu · on Oct 29, 2019

This is very helpful, especially the example at the end, thanks. I think the difficult part to understand is that dividing by n leads to an estimate that is somehow too small. The intuition tells you that dividing by n would just give you the true average.

sanderjd · on Oct 29, 2019

Thanks! This is helpful!

dataflow · on Oct 29, 2019

"Unbiased" means the error is on average zero.

benrbray · on Oct 29, 2019

A mathematician once gave me a convincing argument for dividing by n+1 instead, but I have since forgotten his reason.

mturmon · on Oct 29, 2019

The argument is that this factor can minimize the mean squared error of the variance estimate, at the cost of some bias. In general, the small correction to the lead factor depends on the fourth moment. For a Gaussian, you get 1/(n+1).

This is an example of a “shrinkage estimator”, which comes up a lot - introduce some bias but get a smaller MSE. For more, see: https://en.wikipedia.org/wiki/Bessel%27s_correction

adouzzy · on Oct 29, 2019

Asymptotically, all the same. If sample size is small, it's a bad estimator anyway.

pknerd · on Oct 29, 2019

You guys might like to watch this video:

https://www.youtube.com/watch?v=dq_D30kyR1A

https://www.youtube.com/watch?v=ANsVodOu1Tg

cjhveal · on Oct 29, 2019

For those wondering, this is what the resource at hand says about the matter:

"To be precise, when this mean is calculated, the sum of the squared deviations is divided by one less than the sample size rather than the sample size itself. There's no reason why it must be done this way, but this is the modern convention. It's not important that this seem the most natural measure of spread. It's the way it's done. You can just accept it (which I recommend) or you'll have to study the mathematics behind it. But, that's another course."

mikorym · on Oct 29, 2019

Could you link to which of the pages in OP's book (webpage) this is?

I tried in vain to go to the likely one and still couldn't find where the variance gets introduced!

jfim · on Oct 29, 2019

Just as a warning, while the book mentions multiple comparisons, it's possible to read the book in such a way that one would skip that section.

The section called "A Valuable Lesson" does show that doing multiple tests with the same threshold of P<0.05 does cause inexistant effects to be reported as statistically significant, but the section on correcting for that is present much later in the section about ANOVA.

That's actually a pretty severe flaw, especially for a handbook that is likely to be read partially.

talson · on Oct 29, 2019

I just finished up a class that covered this material. Is there a good next step or book for someone interested in data science.

caliguliminix · on Oct 29, 2019

Not exactly a textbook, but good to read nonetheless is, "Statistics Done Wrong"[1]. It is a short read but is filled with all the most common ways in which statistical analysis is abused with concrete examples.

[1] Statistics Done Wrong: The Woefully Complete Guide https://www.amazon.com/dp/1593276206/ref=cm_sw_r_cp_apa_i_u....

subroutine · on Oct 29, 2019

What software / lang are you currently using to analyze data?

talson · on Oct 29, 2019

It used excel and Python. I’ve also personally used R outside of the class.

subroutine · on Oct 29, 2019

That's good you got exposure to Python (and, I assume, numpy/scipy/pandas etc.), and you're already familiar with R. Are you majoring in data science, and just looking for something extra?

If you're interested in machine learning, Andrew Ng Coursera course is almost a right of passage at this point - it's very accessible: https://www.coursera.org/learn/machine-learning

Kutner is the bible on regression models (tho, not a super fun read): https://www.amazon.com/dp/0073014664

This was one of my favorites as an undergrad: https://www.amazon.com/gp/product/0805833889

talson · on Oct 29, 2019

I’ll check these out, thanks for the recommendations. I’m not majoring in Data science, but I do find it really interesting and want to learn more.

alexhutcheson · on Oct 29, 2019

R for Data Science is great: https://r4ds.had.co.nz/

chefschef · on Oct 29, 2019

I love tiny, reference-able resources like this!

wazoox · on Oct 29, 2019

The anouncement on top of the page[0] makes me sad. I don't want to use a Kindle, neither files in this format. Why no ePub option?

[0]:http://www.jerrydallal.com/LHSP/kdp.htm

jackallis · on Oct 29, 2019

beginner questions: how is information inhere better than others?

AdrienLemaire · on Oct 29, 2019

Just sharing for information: written in 2001 and last modified in 2012. It's great that some resources are not aging.

ptrott2017 · on Oct 29, 2019

updates after 2012 - are in the ebook version for Kindle (read the important announcement section to see how to obtain it)

vondur · on Oct 29, 2019

Well, statistics hasn’t changed much has it?

everybodyknows · on Oct 29, 2019

At the level of application to scientific practice, yes, much has changed. For a start:

https://news.ycombinator.com/item?id=16859052 https://news.ycombinator.com/item?id=18577079

tomrod · on Oct 29, 2019

You have to understand the base to understand the crisis.

Rainymood · on Oct 29, 2019

This is nothing new. People have always whined about the overgeneralised conclusions from studies with low N counts and the inability to reproduce them.

AdrienLemaire · on Oct 29, 2019

This was an informal comment without judgment. On the contrary, it's great to see the Lindy effect in action.

ngcc_hk · on Oct 29, 2019

Seems ok site. Yes. Is there a major development since for basic ?