09:15 – 09:30 Nihat Ay: Welcome Address
-> Video
09:30 – 10:30 Emtiyaz Khan: The Bayesian Learning Rule for Adaptive AI
Humans and animals have a natural ability to autonomously learn and quickly adapt to their surroundings. How can we design AI systems that do the same? In this talk, I will present Bayesian principles to bridge such gaps between humans and AI. I will show that a wide-variety of machine-learning algorithms are instances of a single learning-rule called the Bayesian learning rule. The rule unravels a dual perspective yielding new adaptive mechanisms for machine-learning based AI systems. My hope is to convince the audience that Bayesian principles are indispensable for an AI that learns as efficiently as we do.
-> Video
-> Slides
10:30 – 10:45 break
10:45 – 11:15 Csongor Huba Varady, Luigi Malago, Riccardo Volpi, Nihat Ay: Natural Reweighted Wake-Sleep
-> Video
-> Slides
11:15 – 11:45 Masanari Kimura, Hideitsu Hino: Information Geometry of Dropout Training
-> Video
11:45 – 12:15 Geoffrey Wolfer, Shun Watanabe: Information Geometry of Reversible Markov Chains
-> Video
-> Slides
12:15 – 13:30 break
13:30 – 14:30 Guido Montúfar: Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs
We consider the problem of optimizing the expected long term reward in a Partially Observable Markov Decision Process over the set of memoryless stochastic policies. In this talk I will discuss the properties of the objective function, in particular the existence of policy improvement cones and optimizers in low-dimensional subsets of the search space. Then I will discuss how the problem can be formulated as the optimization of a linear function over a constrained set of state-action frequencies and present descriptions of the parametrization and the constraints, which allows us to estimate the number of critical points and formulate optimization strategies in state-action space. The talk is based on works with Johannes Rauh and Nihat Ay and recent works with Johannes Müller.
-> Video
-> Slides
14:30 – 14:45 break
14:45 – 15:15 Rob Brekelmans, Frank Nielsen: Rho-Tau Bregman Information and the Geometry of Annealing Paths
-> Video
-> Slides
15:15 – 15:45 Alessandro Bravetti, Maria L. Daza-Torres, Hugo Flores-Arguedas, Michael Betancourt: Bregman dynamics, contact transformations and convex optimization
15:45 – 16:15 Wu Lin: Structured second-order methods via natural-gradient descent
-> Video
-> Slides (slightly different from the slides shown in the video)
16:15 – 16:30 break
16:30 – 17:30 James Martens: Rapid training of deep neural networks without normalization, RELUs, or skip connections
Using an extended and formalized version of the Q/C map analysis of Pool et al. (2016) along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the “shape” of the network’s initialization-time kernel function. We then develop a method called Deep Kernel Shaping (DKS), which accomplishes this using a combination of precise parameter initialization, activation function transformations, and small architectural tweaks, all of which preserve the model class. In our experiments we show that DKS enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization performance. And when using K-FAC as the optimizer, we chieve similar results for networks without skip connections. Our results apply for a large variety of activation functions, including hose which traditionally perform very badly, such as the logistic sigmoid. In addition to DKS, we contribute a detailed analysis of skip connections, normalization layers, special activation functions like RELU and SELU, and various initialization schemes, explaining their effectiveness as alternative (and ultimately incomplete) ways of “shaping” the network’s initialization-time kernel.
-> Video
-> Slides
09:30 – 10:30 Minh Ha Quang: Rényi divergences in RKHS and Gaussian process settings
Rényi divergences, including in particular its special case the Kullback-Leibler divergence, play an important role in numerous problems in statistics, probability, and machine learning. In this talk, we present their regularized versions in the reproducing kernel Hilbert space (RKHS) and Gaussian process settings. These are formulated using the Alpha Log-Det divergences on the Hilbert manifold of positive definite Hilbert-Schmidt operators on a Hilbert space. We show that these infinite-dimensional divergences can be consistently estimated from finite sample data, with dimension-independent convergence rates. The theoretical formulations will be illustrated with applications in functional data analysis.
-> Video
10:30 – 10:45 break
10:45 – 11:15 Jakub Bober, Anthea Monod, Emil Saucan, Kevin N. Webster: Rewiring Networks for Graph Neural Network Training Using Discrete Geometry
-> Video
-> Slides
11:15 – 11:45 Geoffrey Wolfer, Shun Watanabe: Geometric Aspects of Data-Processing of Markov Chains
-> Video
-> Slides
11:45 – 12:15 Riccardo Volpi, Luigi Malago: Alpha-Embeddings for Natural Language Processing
-> Video
-> Slides
12:15 – 13:30 break
13:30 – 14:30 Hideitsu Hino: A Geometrical Generalization of Covariate Shift
Many machine learning methods assume that the training data and the test data follow the same distribution, but in the real world, this assumption is very often violated. In particular, the phenomenon that the marginal distribution of the data changes is called the covariate shift, and it is one of the most important research topics. We show that the well-known family of methods for covariate shift adaptation can be unified in the framework of information geometry. Furthermore, we show that parameter search for geometrically generalized methods of covariate shift adaptation can be achieved efficiently by information criterion for a simple parametric case, or by a Bayesian optimization method in general case. It is experimentally shown that the proposed generalization can almost always achieves better performance than the existing methods it encompasses. This work was done in collaboration with my Mr. Masanari Kimura.
-> Video
-> Slides
14:30 – 14:45 break
14:45 – 15:15 Jesse van Oostrum, Johannes Müller, Nihat Ay: Invariance Properties of the Natural Gradient in Overparametrised Systems
-> Video
-> Slides
15:15 – 15:45 Henrique K. Miyamoto, Fábio C. C. Meneghetti, Sueli I. R. Costa: The Fisher-Rao Loss for Learning under Label Noise
-> Video
-> Slides
15:45 – 16:15 Keiji Miura, Ruriko Yoshida: Plücker Coordinates of the best-fit Stiefel Tropical Linear Space to a Mixture of Gaussian Distributions
-> Video
-> Slides
16:15 – 16:30 break
16:30 – 17:30 Leonard Wong: Conformal mirror with descent logarithmic divergences
Divergences such as Bregman and Kullback-Leibler divergences are fundamental in probability, statistics and machine learning. We begin by explaining how divergences arise naturally from the geometry of optimal transport. Then, we study a family of logarithmic costs – originally motivated by financial applications – which may be regarded as a canonical deformation of the negative dot product in Euclidean quadratic transport. It induces a logarithmic divergence which has remarkable probabilistic and geometric properties. As an application, we introduce a generalization of continuous-time mirror descent.
-> Video
-> Slides
09:30 – 10:30 Gabriel Peyré: Optimal Transport for High dimensional Learning
Optimal transport (OT) has recently gained lot of interest in machine learning. It is a natural tool to compare in a geometrically faithful way probability distributions. It finds applications in both supervised learning (using geometric loss functions) and unsupervised learning (to perform generative model fitting). OT is however plagued by the curse of dimensionality, since it might require a number of samples which grows exponentially with the dimension. In this talk, I will explain how to leverage entropic regularization methods to define computationally efficient loss functions, approximating OT with a better sample complexity. More information and references can be found on the website of our book “Computational Optimal Transport”: https://optimaltransport.github.io/
-> Video
10:30 – 10:45 break
10:45 – 11:15 Emil Saucan, Vladislav Barkanass, Jürgen Jost: Coarse geometric kernels for networks embedding
-> Video
11:15 – 11:45 Emil Saucan, Vladislav Barkanass: Can we see the shape of our data?
-> Video
11:45 – 12:15 Pablo A. Morales, Jan Korbel, Fernando E. Rosas: Information-Geometric Framework from convexity
-> Video
12:15 – 13:30 break
13:30 – 14:30 Dominik Janzing: Causal Maximum Entropy Principle: inferring distributions from causal directions and vice versa
The principle of insufficient reason (PIR) assigns equal probabilities to each alternative of a random experiment whenever there is no reason to prefer one over the other. The maximum entropy principle (MaxEnt) generalizes PIR to the case where statistical information like expectations are given. It is known that both principles result in paradoxical probability updates for joint distributions of cause and effect. This is because constraints on the conditional P(effect∣cause) result in changes of P(cause) that assign higher probability to those values of the cause that offer more options for the effect, suggesting “intentional behavior.” Earlier work therefore suggested sequentially maximizing (conditional) entropy according to the causal order, but without further justification apart from plausibility on toy examples. I justify causal modifications of PIR and MaxEnt by separating constraints into restrictions for the cause and restrictions for the mechanism that generates the effect from the cause. I further sketch why causal PIR also entails “Information Geometric Causal Inference.” I will also briefly discuss problems of generalizing the causal version of MaxEnt to arbitrary causal DAGs, which are related to the non-trivial relation between directed and undirected graphical models. I will also describe our recent work on merging datasets to obtain more causal insights.
-> Video
[1] D. Janzing: Causal versions of maximum entropy and principle of insufficient reason, Journal of Causal Inference, 2021.
[2] S.- H. Garrido-Mejia, E. Kirschbaum, D. Janzing: Obtaining Causal Information by Merging Datasets with MAXENT, AISTATS 2022.
[3] D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniusis, B. Steudel, B. Schölkopf: Information-geometric approach to inferring causal directions, Artificial Intelligence 2013.
14:30 – 14:45 break
14:45 – 15:15 M. Ashok Kumar, Kumar Vijay Mishra: Information Geometry of Relative Alpha-Entropy
-> Video
-> Slides
15:15 – 15:45 José Crispín Ruíz-Pantaleón: Areas on the space of smooth probability density functions on S^2
-> Video
-> Slides
15:45 – 16:15 Max von Renesse: Entropic Regularization and Iterative Scaling for Unbalanced Optimal Transport - A Reprise: The Sinkhole Algorithm
-> Video
-> Slides
16:15 – 16:30 break
16:30 – 17:30 Wuchen Li: Transport information Bregman divergences
We study Bregman divergences in probability density space embedded with the Wasserstein-2 metric. Several properties and dualities of transport Bregman divergences are provided. In particular, we derive the transport Kullback-Leibler (KL) divergence by a Bregman divergence of negative Boltzmann-Shannon entropy in Wasserstein-2 space. We also derive analytical formulas and generalizations of transport KL divergence for one-dimensional probability densities and Gaussian families. We also discuss some connections between Wasserstein-2 geometry and information geometry.
-> Video
-> Slides
09:30 – 10:30 Masafumi Oizumi: Unified framework for quantifying causal influences based on information geometry
Assessment of causal influences is a ubiquitous and important subject across diverse research fields. Whereas pairwise causal influences between elements can be easily quantified, quantifying multiple influences among many elements poses two major mathematical difficulties. First, overestimation occurs due to interdependence among influences if each influence is separately quantified in a part-based manner and then simply summed over. Second, it is difficult to isolate causal influences while avoiding noncausal confounding influences. To resolve these difficulties, we propose a theoretical framework based on information geometry for the quantification of multiple causal influences with a holistic approach. In this framework, we quantify causal influences as the divergence between the actual probability
distribution of a system and a constrained probability distribution where causal influences among elements are statistically disconnected. This framework provides intuitive geometric interpretations harmonizing various information theoretic measures in a unified manner, including mutual information (predictive information), transfer entropy, stochastic interaction, and integrated information, each of which is characterized by how causal influences are disconnected. Our framework should help to analyze causal relationships in complex systems in a complete and hierarchical manner.
-> Video
10:30 – 10:45 break
10:45 – 11:15 Carlotta Langer, Nihat Ay: Gradually Increasing the Latent Space in the em-Algorithm
-> Video
-> Slides
11:15 – 11:45 Masahito Hayashi: Algorithm for rate distortion theory based on em algorithm
-> Video
-> Slides
11:45 – 12:15 Hisatoshi Tanaka: Efficient Design of Randomised Experiments
-> Video
-> Slides
12:15 – 13:30 break
13:30 – 14:30 Kenji Fukumizu: Stability in learning of generative adversarial networks
Generative adversarial networks (GANs) learn the probability distirubution of data and generate samples from the learned distribution. While GANs show a remarkable ability to generate samples of high quality, it is known that GANs often show unstable behavior during training. In this work, we develop a theoretical framework for understanding the stability of learning GAN models. We discuss the dynamics of probabilities acquired by the generator of GAN, and derive sufficient conditions that guarantee the convergence of the gradient descent learning. We show that existing GAN variants with stabilization techniques satisfy some, but not all, of these conditions. Using tools from convex analysis, optimal transport, and reproducing kernels, we construct a GAN that fulfills these conditions simultaneously.
-> Video
14:30 – 14:45 break
14:45 – 15:15 Jun Zhang: Partially-Flat Geometry and Natural Gradient Method
-> Video
15:15 – 15:45 Ionas Erb: Power Transformations of Relative Count Data as a Shrinkage Problem
-> Video
15:45 – 16:15 Uriel Legaria, Sergio Martinez, Sergio Mota, Alfredo Coba, Argenis Chable, Antonio Neme: Anomaly detection in the probability simplex under different geometries
-> Video
16:15 – 16:30 break
16:30 – 17:30 Klaus-Robert Müller: Applications of Geometrical Concepts for Learning
I will address the usage of geometrical concepts across a number of application domains. One is to add to the theoretical backbone of explainable AI by studying Diffeomorphic Counterfactuals and Generative Models. The other one — if time permitting — will consider the inclusion of problem inherent (e.g. Lie-group) invariance structure to build less data hungry models.
-> Video
-> Slides
09:30 – 10:30 Sho Sonoda: The Ridgelet Transforms of Neural Networks on Manifolds and Hilbert Spaces
To investigate how neural network parameters are organized and arranged, it is easier to study the distribution of parameters than to study the parameters in each neuron. The ridgelet transform is a pseudo-inverse operator (or an analysis operator) that maps a given function f to the parameter distribution \gamma so that the network S[\gamma] represents f. For depth-2 fully-connected networks on Euclidean space, the closed-form expression has been known, so it could describe how the parameters are organized. However, the closed-form expression has not been known for a variety of today’s neural networks. Recently, our research group has found to systematically derive ridgelet transforms for fully-connected layers on manifolds (non-compact symmetric spaces) and for group convolution layers on abstract Hilbert spaces. In this talk, the speaker will explain a natural way to derive those ridgelet transforms.
-> Video
10:30 – 10:45 break
10:45 – 11:15 Kazu Ghalamkari, Mahito Sugiyama: Non-negative low-rank approximations for multi-dimensional arrays on statistical manifold
-> Video
-> Slides
11:15 – 11:45 Hiroshi Matsuzoe: Geometry of quasi-statistical manifolds and geometric pre-divergences
-> Video
11:45 – 12:15 Domenico Felice, Nihat Ay: A canonical divergence from the perspective of data science
-> Video
12:15 – 13:30 break
13:30 – 14:30 Frank Nielsen, Jun Zhang: Questions and Answers (see tutorials)
Frank Nielsen:
Video: “Introduction to Information Geometry” by Frank Nielsen
Slides: PrintIntroductionInformationGeometry-FrankNielsen.pdf
Email: frank.nielsen.x@gmail.com
Jun Zhang:
Video: Information Geometry Tutorial (2021, BANFF-CMO)
Email: junz@umich.edu
14:30 – 14:45 break
14:45 – 15:45 Nihat Ay: Summary and concluding discussion on Information Geometry for Data Science
Many machine learning methods assume that the training data and the test data follow the same distribution, but in the real world, this assumption is very often violated. In particular, the phenomenon that the marginal distribution of the data changes is called the covariate shift, and it is one of the most important research topics. We show that the well-known family of methods for covariate shift adaptation can be unified in the framework of information geometry. Furthermore, we show that parameter search for geometrically generalized methods of covariate shift adaptation can be achieved efficiently by information criterion for a simple parametric case, or by a Bayesian optimization method in general case. It is experimentally shown that the proposed generalization can almost always achieves better performance than the existing methods it encompasses. This work was done in collaboration with my Mr. Masanari Kimura.
The principle of insufficient reason (PIR) assigns equal probabilities to each alternative of a random experiment whenever there is no reason to prefer one over the other. The maximum entropy principle (MaxEnt) generalizes PIR to the case where statistical information like expectations are given. It is known that both principles result in paradoxical probability updates for joint distributions of cause and effect. This is because constraints on the conditional P(effect∣cause) result in changes of P(cause) that assign higher probability to those values of the cause that offer more options for the effect, suggesting “intentional behavior.” Earlier work therefore suggested sequentially maximizing (conditional) entropy according to the causal order, but without further justification apart from plausibility on toy examples. I justify causal modifications of PIR and MaxEnt by separating constraints into restrictions for the cause and restrictions for the mechanism that generates the effect from the cause. I further sketch why causal PIR also entails “Information Geometric Causal Inference.” I will also briefly discuss problems of generalizing the causal version of MaxEnt to arbitrary causal DAGs, which are related to the non-trivial relation between directed and undirected graphical models. I will also describe our recent work on merging datasets to obtain more causal insights.
Literature:
[1] D. Janzing: Causal versions of maximum entropy and principle of insufficient reason, Journal of Causal Inference, 2021.
[2] S.- H. Garrido-Mejia, E. Kirschbaum, D. Janzing: Obtaining Causal Information by Merging Datasets with MAXENT, AISTATS 2022.
[3] D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniusis, B. Steudel, B. Schölkopf: Information-geometric approach to inferring causal directions, Artificial Intelligence 2013.
Humans and animals have a natural ability to autonomously learn and quickly adapt to their surroundings. How can we design AI systems that do the same? In this talk, I will present Bayesian principles to bridge such gaps between humans and AI. I will show that a wide-variety of machine-learning algorithms are instances of a single learning-rule called the Bayesian learning rule. The rule unravels a dual perspective yielding new adaptive mechanisms for machine-learning based AI systems. My hope is to convince the audience that Bayesian principles are indispensable for an AI that learns as efficiently as we do.
We study Bregman divergences in probability density space embedded with the Wasserstein-2 metric. Several properties and dualities of transport Bregman divergences are provided. In particular, we derive the transport Kullback-Leibler (KL) divergence by a Bregman divergence of negative Boltzmann-Shannon entropy in Wasserstein-2 space. We also derive analytical formulas and generalizations of transport KL divergence for one-dimensional probability densities and Gaussian families. We also discuss some connections between Wasserstein-2 geometry and information geometry.
Using an extended and formalized version of the Q/C map analysis of Pool et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the “shape” of the network’s initialization-time kernel function. We then develop a method called Deep Kernel Shaping (DKS), which accomplishes this using a combination of precise parameter initialization, activation function transformations, and small architectural tweaks, all of which preserve the model class. In our experiments we show that DKS enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization performance. And when using K-FAC as the optimizer, we achieve similar results for networks without skip connections. Our results apply for a large variety of activation functions, including those which traditionally perform very badly, such as the logistic sigmoid. In addition to DKS, we contribute a detailed analysis of skip connections, normalization layers, special activation functions like RELU and SELU, and various initialization schemes, explaining their effectiveness as alternative (and ultimately incomplete) ways of “shaping” the network’s initialization-time kernel.
We consider the problem of optimizing the expected long term reward in a Partially Observable Markov Decision Process over the set of memoryless stochastic policies. In this talk I will discuss the properties of the objective function, in particular the existence of policy improvement cones and optimizers in low-dimensional subsets of the search space. Then I will discuss how the problem can be formulated as the optimization of a linear function over a constrained set of state-action frequencies and present descriptions of the parametrization and the constraints, which allows us to estimate the number of critical points and formulate optimization strategies in state-action space. The talk is based on works with Johannes Rauh and Nihat Ay and recent works with Johannes Müller.
Assessment of causal influences is a ubiquitous and important subject across diverse research fields. Whereas pairwise causal influences between elements can be easily quantified, quantifying multiple influences among many elements poses two major mathematical difficulties. First, overestimation occurs due to interdependence among influences if each influence is separately quantified in a part-based manner and then simply summed over. Second, it is difficult to isolate causal influences while avoiding noncausal confounding influences. To resolve these difficulties, we propose a theoretical framework based on information geometry for the quantification of multiple causal influences with a holistic approach. In this framework, we quantify causal influences as the divergence between the actual probability
distribution of a system and a constrained probability distribution where causal influences among elements are statistically disconnected. This framework provides intuitive geometric interpretations harmonizing various information theoretic measures in a unified manner, including mutual information (predictive information), transfer entropy, stochastic interaction, and integrated information, each of which is characterized by how causal influences are disconnected. Our framework should help to analyze causal relationships in complex systems in a complete and hierarchical manner.
Optimal transport (OT) has recently gained lot of interest in machine learning. It is a natural tool to compare in a geometrically faithful way probability distributions. It finds applications in both supervised learning (using geometric loss functions) and unsupervised learning (to perform generative model fitting). OT is however plagued by the curse of dimensionality, since it might require a number of samples which grows exponentially with the dimension. In this talk, I will explain how to leverage entropic regularization methods to define computationally efficient loss functions, approximating OT with a better sample complexity. More information and references can be found on the website of our book ” Computational Optimal Transport“.
Rényi divergences, including in particular its special case the Kullback-Leibler divergence, play an important role in numerous problems in statistics, probability, and machine learning. In this talk, we present their regularized versions in the reproducing kernel Hilbert space (RKHS) and Gaussian process settings. These are formulated using the Alpha Log-Det divergences on the Hilbert manifold of positive definite Hilbert-Schmidt operators on a Hilbert space. We show that these infinite-dimensional divergences can be consistently estimated from finite sample data, with dimension-independent convergence rates. The theoretical formulations will be illustrated with applications in functional data analysis.
Divergences such as Bregman and Kullback-Leibler divergences are fundamental in probability, statistics and machine learning. We begin by explaining how divergences arise naturally from the geometry of optimal transport. Then, we study a family of logarithmic costs – originally motivated by financial applications – which may be regarded as a canonical deformation of the negative dot product in Euclidean quadratic transport. It induces a logarithmic divergence which has remarkable probabilistic and geometric properties. As an application, we introduce a generalization of continuous-time mirror descent.