Research

Confounding emerges whenever data spans multiple populations, environments, or laboratories – an unavoidable setting in large-scale datasets. The search for new signals in this data can awaken previously innocuous confounding effects, even further exacerbated by the unprecedented power of ML. The novelty of these phenomena conceal them from the intuitions of domain knowledge, making them silent killers of scientific rigor. My work centers around evaluating when data fusion is safe, deconfounding results when it is not safe, and the paradoxes that arise when aggregating conclusions from non-fused data sources.

Topics

Confounder Identification
All unsupervised techniques rely on distributional assumptions in order to recover components or clusters. Discrete data poses an interesting challenge in that categorical distributions are inherently non-parametric, prohibiting the use of parametric assumptions. One alternative approach, which I have pioneered, is the use of the use of causal structures to help separate unconfounded components in data. Such an approach is natural, as confounded systems are an inherently causal problem. This perspective expands the notion of causal identifiability, as many graphically unidentifiable relationships can be identified.
Confounder Identification
Decision Fusion
It is not unusual for studies and experts to disagree with each other. Such disagreement is often driven by differing contexts rather than incorrect deductions or bad data. One way to investigate contradiction is to use merged data to build a larger picture. Unfortunately, many private medical settings deny direct access to data. Through decision fusion, I seek to understand when results from different settings are in conflict and what these disagreements can indicate about the underlying system.
Decision Fusion
Distribution Shift and Conterfactual Features
Features often contain a mixture of “good” and “bad” information. From a fairness standpoint, SAT scores contain information about both inherent academic ability, and also access to tutoring resources. From a domain adaptation standpoint, some information may have stable and reliable relationships with the prediction label, while other relationships break down. My work uses insights from causal inference to determine data-representations that sort between the different components of information that are hidden in these ambiguous features.
Distribution Shift and Conterfactual Features
Time-Dependent Genomic Signatures for Cancer Classification and Prediction
We process non-coding regions of the genome which contain duplication and mutation signatures. These mutation profiles have been shown to be predictive of various forms of cancer.
Time-Dependent Genomic Signatures for Cancer Classification and Prediction

Recent Publications

(2023). Discrete Nonparametric Causal Discovery under Latent Class Confounding.

PDF Cite Project

(2023). Omitted Labels in Causality: A Study of Paradoxes.

PDF Cite Project

(2023). Identification of Mixtures of Discrete Product Distributions in Near-Optimal Sample and Time Complexity.

PDF Cite Project

(2023). Causal Information Splitting: Engineering Proxy Features for Robustness to Distribution Shifts. In UAI 2023.

PDF Cite Project Poster Video

(2023). Causal Inference Despite Limited Global Confounding via Mixture Models. In CLeaR 2023.

PDF Cite Project Poster