Distribution Shift and Transportability

The principle assumption when building any (not necessarily causal) prediction model is access to relevant data for the task at hand. When predicting label $Y$ from inputs $\mathbf{X}$, this assumption reads that the data is drawn from a (training) probability distribution $\mathbf{X}, Y$ that is identical to the distribution that will generate its use-cases (target distribution).

Unfortunately, the dynamic nature of real-world systems makes obtaining perfectly relevant data difficult. Datagathering mechanisms can introduce sampling bias, yielding distorted training data. Even in the absence of sampling biases, populations, environments, and interventions give rise to distribution shifts in their own right.

This issue is fundimentally related to causality: causal relationships are more likely to hold up in new environments than correlational ones. In the past, I have worked on new methods for reweighting data for domain adaptation. Currently, I am interested in developing frameworks to understand distribution shift from a causal perspective, especially in the presence of unobserved concepts (which can span both causes and effects).