Short Tandem Repeats Information in TCGA is Statistically Biased by Amplification

Siddharth Jain, Bijan Mazaheri, Netanel Raviv, Jehoshua Bruck

January, 2019

Abstract

The current paradigm in data science is based on the belief that given sufficient amounts of data, classifiers are likely to uncover the distinction between true and false hypotheses. In particular, the abundance of genomic data creates opportunities for discovering disease risk associations and help in screening and treatment. However, working with large amounts of data is statistically beneficial only if the data is statistically unbiased. Here we demonstrate that amplification methods of DNA samples in TCGA have a substantial effect on short tandem repeat (STR) information. In particular, we design a classifier that uses the STR information and can distinguish between samples that have an analyte code D and an analyte code W. This artificial bias might be detrimental to data driven approaches, and might undermine the conclusions based on past and future genome wide studies.

Type

Preprint

Distribution Shift Cancer

Short Tandem Repeats Information in TCGA is Statistically Biased by Amplification

Abstract

Bijan Mazaheri

Postdoctoral Associate