Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Krautenbacher, Norbert; Theis, Fabian J.; Fuchs, Christiane

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Krautenbacher N, Theis FJ, Fuchs C (2017)
Computational and Mathematical Methods in Medicine 2017: 7847531.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!

DOI

https://doi.org/10.1155/2017/7847531

Autor*in

Krautenbacher, Norbert; Theis, Fabian J.; Fuchs, Christiane^UniBi

Einrichtung

Fakultät für Wirtschaftswissenschaften > Lehrstuhl für Data Science

Abstract / Bemerkung

Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.

Erscheinungsjahr

2017

Zeitschriftentitel

Computational and Mathematical Methods in Medicine

Band

2017

Art.-Nr.

7847531

ISSN

1748-670X

eISSN

1748-6718

Page URI

https://pub.uni-bielefeld.de/record/2934011

Zitieren

Krautenbacher N, Theis FJ, Fuchs C. Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine. 2017;2017: 7847531.

Krautenbacher, N., Theis, F. J., & Fuchs, C. (2017). Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine, 2017, 7847531. doi:10.1155/2017/7847531

Krautenbacher, Norbert, Theis, Fabian J., and Fuchs, Christiane. 2017. “Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies”. Computational and Mathematical Methods in Medicine 2017: 7847531.

Krautenbacher, N., Theis, F. J., and Fuchs, C. (2017). Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine 2017:7847531.

Krautenbacher, N., Theis, F.J., & Fuchs, C., 2017. Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine, 2017: 7847531.

N. Krautenbacher, F.J. Theis, and C. Fuchs, “Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies”, Computational and Mathematical Methods in Medicine, vol. 2017, 2017, : 7847531.

Krautenbacher, N., Theis, F.J., Fuchs, C.: Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine. 2017, : 7847531 (2017).

Krautenbacher, Norbert, Theis, Fabian J., and Fuchs, Christiane. “Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies”. Computational and Mathematical Methods in Medicine 2017 (2017): 7847531.

Daten bereitgestellt von European Bioinformatics Institute (EBI)

Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

37 References

Daten bereitgestellt von Europe PubMed Central.

Case-Control Studies. Design, Conduct, Analysis.
Rossiter C., Schlesselman J.., 1983

Validation and updating of predictive logistic regression models: a study on sample size and shrinkage.
Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD., Stat Med 23(16), 2004
PMID: 15287085

Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods.
Huang Y, Pepe MS., Stat Med 29(13), 2010
PMID: 20527013

A Note on Risk Prediction for Case-Control Studies, 2008
Rose S., van M.., 0

A simple method to adjust clinical prediction models to local circumstances.
Janssen KJ, Vergouwe Y, Kalkman CJ, Grobbee DE, Moons KG., Can J Anaesth 56(3), 2009
PMID: 19247740

A two stage design for the study of the relationship between a rare exposure and a rare disease.
White JE., Am. J. Epidemiol. 115(1), 1982
PMID: 7055123

Two-stage designs for gene-disease association studies with sample size constraints.
Satagopan JM, Venkatraman ES, Begg CB., Biometrics 60(3), 2004
PMID: 15339280

Secondary analysis under cohort sampling designs using conditional likelihood
Saarela O., Kulathinal S., Karvanen J.., 2012

Baseline integrated behavioural and biological assessment among most at-risk populations in six high-prevalence states of India: design and implementation challenges.
Saidel T, Adhikary R, Mainkar M, Dale J, Loo V, Rahman M, Ramesh BM, Paranjape RS., AIDS 22 Suppl 5(), 2008
PMID: 19098477

Health-related characteristics of men who have sex with men: a comparison of those living in "gay ghettos" with those living elsewhere.
Mills TC, Stall R, Pollack L, Paul JP, Binson D, Canchola J, Catania JA., Am J Public Health 91(6), 2001
PMID: 11392945

An empirical comparison of respondent-driven sampling, time location sampling, and snowball sampling for behavioral surveillance in men who have sex with men, Fortaleza, Brazil.
Kendall C, Kerr LR, Gondim RC, Werneck GL, Macena RH, Pontes MK, Johnston LG, Sabin K, McFarland W., AIDS Behav 12(4 Suppl), 2008
PMID: 18389357

Learning and evaluating classifiers under sample selection bias
Zadrozny B.., 0

Sample selection bias as a specification error
Heckman J.., 1979

Sample selection bias correction theory
Cortes C., Mohri M., Riley M., Rostamizadeh A.., 2008

Logistic regression in rare events data
King G., Zeng L.., 2001

Analysis of complex survey samples
Lumley T.., 2004

Using sample survey weights in multiple regression analyses of stratified samples
Dumouchel W., Duncan G.., 1983

Cost-sensitive learning by cost-proportionate example weighting
Zadrozny B., Langford J., Abe N.., 0

On sample selection bias and its efficient correction via model averaging and unlabeled examples
Fan W., Davidson I.., 0

The foundations of cost-sensitive learning
Elkan C.., 0

A generalization of sampling without replacement from a finite universe
Horvitz D., Thompson D.., 1952

Estimation of regression coefficients when some regressors are not always observed
Robins J., Rotnitzky A., Zhao L.., 1994

Bagging predictors
Breiman L.., 1996

Using Inverse Probability Bootstrap Sampling to Eliminate Sample Induced Bias in Model Based Analysis of Unequal Probability Samples.
Nahorniak M, Larsen DP, Volk C, Jordan CE., PLoS ONE 10(6), 2015
PMID: 26126211

SMOTE: synthetic minority over-sampling technique
Chawla N., Bowyer K., Hall L., Kegelmeyer W.., 2002

Regression
Fahrmeir L., Kneib T., Lang S.., 2009

Random forests
Breiman L.., 2001

Hastie T., Tibshirani R., Friedman J.., 2001

An introduction to ROC analysis
Fawcett T.., 2006

Core R.., 2015

ranger: a fast implementation of random forests for high dimensional data in C++ and R
Wright M., Ziegler A.., 2017

Meyer D., Dimitriadou E., Hornik A., Weingessel. null, Leisch F.., 2015

smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE
Siriseriwan W.., 0

pROC: an open-source package for R and S+ to analyze and compare ROC curves.
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Muller M., BMC Bioinformatics 12(), 2011
PMID: 21414208

ROCR: visualizing classifier performance in R.
Sing T, Sander O, Beerenwinkel N, Lengauer T., Bioinformatics 21(20), 2005
PMID: 16096348

OpenML: Networked Science in Machine Learning
Vanschoren J., van J., Bischl B., Torgo L.., 2014

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.
DeLong ER, DeLong DM, Clarke-Pearson DL., Biometrics 44(3), 1988
PMID: 3203132

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 29312464
PubMed | Europe PMC

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld