HN2new | past | comments | ask | show | jobs | submitlogin

Note that one property described is not true of all distributions. I've used the exponential distribution to predict completion times before, and the exponential distribution has the simple property that on average completion is always the mean completion time at zero away. I have no idea if this is accurate or not, just that I've found the simplicity convenient. In contrast, for the power law distribution used in the blog post the mean completion time from now increases as time passes. (In both cases, average time from start to finish increases on average as time passes.)

I chose the exponential distribution because it's the maximum entropy distribution for a positive number with known mean: https://en.wikipedia.org/wiki/Maximum_entropy_probability_di...



So in my limited understanding, the exponential distribution popular because it's very easy to work with -- you can actually analytically solve some queueing problems, for example, that wouldn't be possible with other distributions.

But power-law distributions show up over and over in things that people here care about: file size, network traffic, process lifetimes, etc etc. In these cases the exponential will drastically underestimate the fat tail.


A power law distribution is roughly as easy to work with here as exponential. The blog post contains the power law results for this case, which are fairly easily obtained through a conditional average (conditioning on t > t_0).

The important question you bring up is which is more accurate, which I don't have an answer for. But perhaps a reader has data on this. I will note that I compared exponential against some FOIA request processing data while back and thought it was okay, though I don't remember anything quantitative; see here: https://hackernews.hn/item?id=21032750

I think it's likely that something with more parameters like a log-normal distribution would be better than either, but intuitively I doubt you'd be able to get simple equations for the mean remaining time out of that.

One problem with the power law model is that the expected duration at t = 0 is 0. The exponential model does not have that problem. You could fix this for a power law by not having power law behavior for short times.


I learned about the prevalence of Pareto distributions in computer systems from Harchol-Balter's Performance Modeling and Design of Computer Systems[0] (which, I will admit, I properly understood perhaps 25% of).

She reported the Pareto distribution of process times from first data collected in 1997[1].

[0] https://www.cs.cmu.edu/~harchol/PerformanceModeling/book.htm...

[1] https://www.cs.cmu.edu/~harchol/Papers/TOCS.pdf


A reference to Jaynes. :) Have an upvote!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: