Identification of Mixtures of Discrete Product Distributions in Near-Optimal Sample and Time Complexity

Abstract

We consider the problem of identifying, from statistics, a distribution of discrete random variables X1,,Xn that is a mixture of k product distributions. The best previous sample complexity for nO(k) was (1/ζ)O(k2logk) (under a mild separation assumption parameterized by ζ). The best known lower bound was exp(Ω(k)). It is known that n2k1 is necessary and sufficient for identification. We show, for any n2k1, how to achieve sample complexity and run-time complexity (1/ζ)O(k). We also extend the known lower bound of eΩ(k) to match our upper bound across a broad range of ζ. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.

Publication
In The 37th Annual Conference on Learning Theory
Bijan Mazaheri
Bijan Mazaheri

My interests include mixture models, high level data fusion, and stability to distribution shift - usually through the lense of causality.