As a student of Information Geometry, let me provide some context why this is such an exciting field. Usually geometry reminds someone of triangle and other shapes that we see in our immediate surroundings, which is called as Euclidean or just flat geometry i. e a space where Euclid's axioms are valid. But there is a lot more to the story, we can bend/reformulate or even exclude certain axioms to conjure up new spaces - such as hyperbolic geometry or in general some complex curved geometry. Turns out, these indeed have wide applications in our real world.
Now what's all this gotta do with Information? Usually, information is represented in terms of statistical distributions, from Shannon's information theory. What the early founders of IG observed is that, these statistical distributions can be represented as points on some curved space called a Statistical Manifold. Now, all the terms used in information theory can be reinterpreted in terms of geometry.
So, why is it so exciting? Well in Deep Learning people predominantly work statistical distributions, some even without realising it. All our optimizations involve reducing distance between some statistical distributions like the distribution of of the data and the distribution that the neural network is trying to model. Turns out, such optimization when done in the space of statistical manifold, amounts to the gradient descent that we all know and love. All the gradient based optimisations are only approximations to the local geometry like gradient(local slope) , Hessian(local quadratic approximation of curvature), but optimisation in the statistical manifold can yield the exact curvature and thus are more efficient. This method is called Natural Gradient.
That's indeed mathematically very exciting and reason enough to study IG.
But does IG allow us to reason about Neural Nets in new ways that could move the needle on open questions about information representation in ANNs or, even better, BNNs?
Good question. Most people are focussing on the natural gradient and making it as efficient as SGD. But some have been exploring if we can introduce inductive bias in the function space rather than the weight space, using IG. But it is still quite a new field.
So it's the geometry of the set of probability measures on a pre-measure space? If so, that sounds like something that might be interesting for point processes as well.
Now what's all this gotta do with Information? Usually, information is represented in terms of statistical distributions, from Shannon's information theory. What the early founders of IG observed is that, these statistical distributions can be represented as points on some curved space called a Statistical Manifold. Now, all the terms used in information theory can be reinterpreted in terms of geometry.
So, why is it so exciting? Well in Deep Learning people predominantly work statistical distributions, some even without realising it. All our optimizations involve reducing distance between some statistical distributions like the distribution of of the data and the distribution that the neural network is trying to model. Turns out, such optimization when done in the space of statistical manifold, amounts to the gradient descent that we all know and love. All the gradient based optimisations are only approximations to the local geometry like gradient(local slope) , Hessian(local quadratic approximation of curvature), but optimisation in the statistical manifold can yield the exact curvature and thus are more efficient. This method is called Natural Gradient.
Hope this helps.