More

mccourt · on May 28, 2021

This is a cool post, and I really appreciate the link/references to other content. At one point I played around with parametrizing procedural generation of mazes to produce a desired length/complexity balance (https://sigopt.com/blog/building-better-multicriteria-bayesi...). Would love to try that out on a more complicated situation like the one you're working on. Thanks for the content.

iam4722202468 · on May 29, 2021

Glad you enjoyed! Investigating how tunable parameters impact the generation is an interesting idea, I'll definitely investigate it for my own application.

mccourt · on May 28, 2021

You can really do training in just hours? I didn't realize how efficient current systems are for this.

mccourt · on July 10, 2020

This blog post lists a bunch of gradient-free optimization packages, some genetic and some Bayesian: https://sigopt.com/blog/comparison-bayesian-packages-hyperpa.... Nothing from the mathematical programming community in here, so definitely other options too (depending on what kind of problems you are trying to address).

Personally, I like Nevergrad (https://facebookresearch.github.io/nevergrad/) a lot for general purpose optimization problems -- I think it is very well implemented and has a variety of tools available. I also think the documentation is appropriately honest about what is and is not known for how these algorithms work in different circumstances.

If you want something Bayesian (sample-efficient) which is very lightweight, I like PySOT (https://pysot.readthedocs.io/en/latest/). Part of why I like it is that it's written by friends of mine, but I also legitimately like its performance across a decent set of problems.

If you want something Bayesian which has corporate support (so that you know it's updated/maintained), I would recommend Botorch/Ax from Facebook (https://botorch.org/docs/botorch_and_ax). They have done a lot of research for it (a recent preprint is here https://arxiv.org/pdf/2006.05078.pdf) and have put together a very solid implementation including considerations for running online optimization problems. I think the documentation is a bit weak, but the software and research is outstanding.

Another corporate-supported option is Optuna (https://optuna.org/) from Preferred Networks. I also know some of the people working on this and I think it is a good implementation of the kernel density estimation strategy for statistical modeling -- preferring lower computational cost and consistent timing over performance. I had difficulties running it in server mode while I was testing, but if you're running locally that will not be a problem.

As is always the case with optimization strategies, there is no one answer. Different tools perform well in different circumstances. There can be bad tools, but, likely, there will never be a best tool (in my estimation).

RandomWorker · on July 10, 2020

Thank you! Lots of resources to explore!

mccourt · on July 9, 2020

Great example of the fact that different circumstances require different approaches, and the fact that brute force is increasingly impractical for combinatorially complex problems. Thanks for this reference.

mccourt · on July 9, 2020

A very reasonable point and, certainly, the direction that parts of the computational community have embraced over the years. I will use integration as an example: classic computational methods were focused on trying to make strong assumptions about the integrand and significantly reduce the number of integrand evaluations (Gauss quadrature is the main thing that comes to mind). As computation became more accessible/parallelizable, and problems became less analytic, Monte Carlo methods have become more fundamental.

In some distributed computational settings, memory traffic is actually the main bottleneck and redundant computations are executed to reduce the need to send data (a similar situation to the one you aptly describe).

I think that, in the case of hyperparameter/meta-learning optimization (or search, depending on how you think about it) we are at a time right now where the complexity of models which can effectively be put into production is a function of our ability to, at least partially, analyze the space of possible modeling decisions. Will we escape that, and have models whose training cost is less significant than the cost of executing an "intelligent" hyperparameter search process? Maybe ... I am a GP person so I see potential in clever analysis of circumstances so that RKHS methods (for instance) can be leveraged and simplify the training process. But the current trajectory of the community has been to work on increasingly expensive models, which makes the ability to effectively use them with limited tuning/search cost still relevant.

cs702 · on July 9, 2020

Yes. Upon further reflection, this competition actually looks like another step in that direction (see this thread: https://hackernews.hn/item?id=23784239).

Otherwise I agree, Gaussian Processes are nice and friendly, and work quite well for low-dimensional search (e.g., from a few to hundreds of hyperparameters) under very natural, general assumptions :-)

mccourt · on April 18, 2019

I always appreciate articles emphasizing the importance of hyperparameter optimization; thank you for writing this. The discussion on learning rate is nice additional point to mention, though I find it a bit misleading -- earlier in the discussion you are mentioning a number of hyperparameters but then learning rate is studied in a vacuum. If other hyperparameters were varied along with the learning rate, I assume those graphics would look much more complicated.

Additionally, practical circumstances for hyperparameter tuning using Bayesian optimization often include complications: dealing with discrete hyperparameters, large parameter spaces being unreasonably costly or poorly modeled, accounting for uncertainty in your metric, balancing competing metrics, black-box constraints. Obviously, one cannot mention everything in a blog post, I just wanted to bring up that outstanding researchers in Bayesian optimization are pushing forward on all of these topics.

Regardless, thank you for continuing to hammer home the value of hyperparameter optimization. If I may, a couple links, for anyone trying to learn more:

My favorite BO intro - https://arxiv.org/abs/1807.02811 AutoML from the Freiburg crew - http://papers.nips.cc/paper/5872-efficient-and-robust-automa... Some discussion on parallelism/high dimensions - https://bayesopt.github.io/papers/2017/3.pdf Strategies for warm starting - https://ml.informatik.uni-freiburg.de/papers/18-AUTOML-RGPE....

mccourt · on Aug 29, 2018

This is a good article, but I wish it went a bit further into the complexities facing Hong Kong as it seeks its place in a 21st century where access to Chinese investments is no longer throttled by access to Hong Kong. I first lived in Hong Kong as a student in 2006 and I remember thinking that the tech push starting in Shenzhen would be dwarfed by tech growth in HK given its already prominent status in the international community and density of top notch universities.

When I visited in January, though, it felt like the opposite has happened; Tencent has grown into an international behemoth and HK has failed to really choose to invest in tech growth. I have friends at SUSTC, so I'm glad that Shenzhen is making big things happen, but it is disappointing to see HK still focused on running/growing the economy according to a 1980s playbook. Even the cyberport really ended up feeling more like just a ploy to be able to build more housing rather than an actual tech community.

Still time to make big things happen in HK, but, in my opinion, it's going to take leaders in the business community to push for it and make investments. It will also take strong collaboration with the outstanding academic community in HK to make that happen.

I'd love to hear from people currently in HK about how then view the tech sector and what their plans are.

mccourt · on Dec 16, 2016

Bayesian optimization, I am familiar with, but Q-learning, not so much. If anyone has good references on or introductions to Q-learning I would appreciate it.

mccourt · on Aug 17, 2016

I'd also like to throw in some work by a former colleague of mine at Argonne, Sven Leyffer on nonlinear programming: - A compendium he co-edited named (appropriately enough) Mixed Integer Nonlinear Programming - A review paper he co-authored for Acta Numerica: http://www.mcs.anl.gov/papers/P3060-1112.pdf

Also, yeah, the "Alexander Schrijver - Theory of Linear and Integer Programming" reference is solid.

wolfgke · on Aug 17, 2016

The survey article by Sven Leyffer is a good paper, but comes from a completely different direction:

It comes from people from convex optimization trying to additionally apply some integrality conditions (a little bit as second-class citizen). On the other hand classical combinatorial optimization is integrality conditions as first-class citizen. I, coming from (M)ILP, would argue that the MINLP people coming from convex optimization tend to sidestep all the problems that make ILP so hard (and interesting). On the other hand MINLP people would equally vocally argue that the (M)ILP people tend to prefer "academic" problems and don't grasp how many important research questions they miss.

It's up to the reader to decide which side is right. :-)

My personal opinion in this "flamewar" is that if you come from a computer science background (in particular theoretical computer science) you will probably prefer classic MILP culture. On the other hand if you come from engineering you will probably prefer MINLP theory as outlined in Sven Leyffer's survey paper.

mccourt · on Aug 17, 2016

Yeah, I think that's probably the split - folks from computer science/discrete math on one side and folks from engineering on the other. I grew up in math, but I was on the numerical analysis side so I definitely ended up on the MINLP side, which is why that's what I generally reference. There is certainly something elegant about ILP problems which gets lost when treating them with the sledgehammer that is gradient-based convex optimization.

wolfgke · on Aug 18, 2016

> There is certainly something elegant about ILP problems which gets lost when treating them with the sledgehammer that is gradient-based convex optimization.

It is funny that you call gradient-based convex optimization a "sledgehammer" since people working in combinatorial optimization (opposed to ILP) tend to denote ILP methods (e.g. cutting plane algorithms, branch & bound, branch & cut, relaxation hierarchies, ...) also as a "sledgehammer". :-D They are just jealous. :-)

mccourt · on Aug 17, 2016

That strategy can be viable; it's discussed in the Wikipedia article: https://en.wikipedia.org/wiki/Multi-objective_optimization#N...

As is suggested there, though, implementing this no-preference strategy requires some clean rescaling of the component functions in order to yield equal significance for all of them. If you have such a rescaling, that's outstanding; however, as I suggested in the section of the article dealing with the impact of the choice of currency, rescaling a problem may be a difficult proposition. This is especially true for problems that aren't as simple as the toy problem I've proposed here.