Causal Representation Learning for Latent Space Optimization

Abstract

In this thesis, we study causal representation learning for latent space optimization, which allows for robust and efficient generation of novel synthetic data with maximal target value. We assume that the observed data was generated by a few latent factors, some of which are causally related to the target and others of which are spuriously correlated with the target and confounded by an environment variable. Our proposed method consists of three steps, which exploits the structure of the causal graph that describes the assumed underlying data generating process. In the first step, we recover the true data representation (i.e., the latent factors from which the observed data originated). We obtain novel identifiability theory, showing that the true data representation can be recovered up to simple transformations by a generalized version of identifiable variational auto-encoders. In the second step, we identify the causal latent factors of the target, for which we propose a practical causal inference scheme that employs (conditional) independence tests and causal discovery algorithms. Our method does not require having access to the true environment variable, which overcomes a major limitation of existing causal representation learning approaches in the literature. In the final step, we query latent points that correspond to data points with high target values by intervening upon the causal latent factors using standard latent space optimization techniques. We empirically evaluate and thoroughly analyze our method on three different tasks, including a chemical design task. We show that our method can successfully recover the true data representation in the finite data regime and correctly identify the causal latent factors of the target, which results in state-of-the-art performance for black-box optimization.

Publication
MPhil Thesis, University of Cambridge
Wenlin Chen
Wenlin Chen
PhD Student in Machine Learning