Research

Confounding emerges whenever data spans multiple populations, environments, or laboratories – an unavoidable setting in large-scale datasets. The search for new signals in this data can awaken previously innocuous confounding effects, even further exacerbated by the unprecedented power of ML. The novelty of these phenomena conceal them from the intuitions of domain knowledge, making them silent killers of scientific rigor. My work centers around evaluating when data fusion is safe, deconfounding results when it is not safe, and the paradoxes that arise when aggregating conclusions from non-fused data sources.

Topics

Confounder Identification

All unsupervised techniques rely on distributional assumptions in order to recover components or clusters. Discrete data poses an interesting challenge in that categorical distributions are inherently non-parametric, prohibiting the use of parametric assumptions. One alternative approach, which I have pioneered, is the use of the use of causal structures to help separate unconfounded components in data. Such an approach is natural, as confounded systems are an inherently causal problem. This perspective expands the notion of causal identifiability, as many graphically unidentifiable relationships can be identified.

Decision Fusion

It is not unusual for studies and experts to disagree with each other. Such disagreement is often driven by differing contexts rather than incorrect deductions or bad data. One way to investigate contradiction is to use merged data to build a larger picture. Unfortunately, many private medical settings deny direct access to data. Through decision fusion, I seek to understand when results from different settings are in conflict and what these disagreements can indicate about the underlying system.

Distribution Shift and Conterfactual Features

Features often contain a mixture of “good” and “bad” information. From a fairness standpoint, SAT scores contain information about both inherent academic ability, and also access to tutoring resources. From a domain adaptation standpoint, some information may have stable and reliable relationships with the prediction label, while other relationships break down. My work uses insights from causal inference to determine data-representations that sort between the different components of information that are hidden in these ambiguous features.

Time-Dependent Genomic Signatures for Cancer Classification and Prediction

We process non-coding regions of the genome which contain duplication and mutation signatures. These mutation profiles have been shown to be predictive of various forms of cancer.

Recent Publications

Bijan Mazaheri, Spencer Gordon, Yuval Rabani, Leonard Schulman (2023). Discrete Nonparametric Causal Discovery under Latent Class Confounding.

PDF Cite Project

Bijan Mazaheri, Siddharth Jain, Matthew Cook, Jehoshua Bruck (2023). Omitted Labels in Causality: A Study of Paradoxes.

PDF Cite Project

Spencer Gordon, Eric Jahn, Bijan Mazaheri, Yuval Rabani, Leonard Schulman (2023). Identification of Mixtures of Discrete Product Distributions in Near-Optimal Sample and Time Complexity.

PDF Cite Project

Bijan Mazaheri, Atalanti Mastakouri, Dominik Janzing, Mila Hardt (2023). Causal Information Splitting: Engineering Proxy Features for Robustness to Distribution Shifts. In UAI 2023.

PDF Cite Project Poster Video

Spencer Gordon, Bijan Mazaheri, Yuval Rabani, Leonard Schulman (2023). Causal Inference Despite Limited Global Confounding via Mixture Models. In CLeaR 2023.

PDF Cite Project Poster

See all publications