Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Krautenbacher N, Theis FJ, Fuchs C (2017)
Computational and Mathematical Methods in Medicine 2017: 7847531.

Download
Es wurde kein Volltext hochgeladen. Nur Publikationsnachweis!
Zeitschriftenaufsatz | Veröffentlicht | Englisch
Autor
; ;
Abstract / Bemerkung
Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.
Erscheinungsjahr
Zeitschriftentitel
Computational and Mathematical Methods in Medicine
Band
2017
Art.-Nr.
7847531
ISSN
eISSN
PUB-ID

Zitieren

Krautenbacher N, Theis FJ, Fuchs C. Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine. 2017;2017: 7847531.
Krautenbacher, N., Theis, F. J., & Fuchs, C. (2017). Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine, 2017, 7847531. doi:10.1155/2017/7847531
Krautenbacher, N., Theis, F. J., and Fuchs, C. (2017). Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine 2017:7847531.
Krautenbacher, N., Theis, F.J., & Fuchs, C., 2017. Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine, 2017: 7847531.
N. Krautenbacher, F.J. Theis, and C. Fuchs, “Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies”, Computational and Mathematical Methods in Medicine, vol. 2017, 2017, : 7847531.
Krautenbacher, N., Theis, F.J., Fuchs, C.: Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine. 2017, : 7847531 (2017).
Krautenbacher, Norbert, Theis, Fabian J., and Fuchs, Christiane. “Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies”. Computational and Mathematical Methods in Medicine 2017 (2017): 7847531.

37 References

Daten bereitgestellt von Europe PubMed Central.

Case-Control Studies. Design, Conduct, Analysis.
Rossiter C., Schlesselman J.., 1983
Validation and updating of predictive logistic regression models: a study on sample size and shrinkage.
Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD., Stat Med 23(16), 2004
PMID: 15287085
A Note on Risk Prediction for Case-Control Studies, 2008
Rose S., van M.., 0
A simple method to adjust clinical prediction models to local circumstances.
Janssen KJ, Vergouwe Y, Kalkman CJ, Grobbee DE, Moons KG., Can J Anaesth 56(3), 2009
PMID: 19247740
Two-stage designs for gene-disease association studies with sample size constraints.
Satagopan JM, Venkatraman ES, Begg CB., Biometrics 60(3), 2004
PMID: 15339280
Secondary analysis under cohort sampling designs using conditional likelihood
Saarela O., Kulathinal S., Karvanen J.., 2012
Baseline integrated behavioural and biological assessment among most at-risk populations in six high-prevalence states of India: design and implementation challenges.
Saidel T, Adhikary R, Mainkar M, Dale J, Loo V, Rahman M, Ramesh BM, Paranjape RS., AIDS 22 Suppl 5(), 2008
PMID: 19098477
Health-related characteristics of men who have sex with men: a comparison of those living in "gay ghettos" with those living elsewhere.
Mills TC, Stall R, Pollack L, Paul JP, Binson D, Canchola J, Catania JA., Am J Public Health 91(6), 2001
PMID: 11392945
An empirical comparison of respondent-driven sampling, time location sampling, and snowball sampling for behavioral surveillance in men who have sex with men, Fortaleza, Brazil.
Kendall C, Kerr LR, Gondim RC, Werneck GL, Macena RH, Pontes MK, Johnston LG, Sabin K, McFarland W., AIDS Behav 12(4 Suppl), 2008
PMID: 18389357
Learning and evaluating classifiers under sample selection bias
Zadrozny B.., 0
Sample selection bias as a specification error
Heckman J.., 1979
Sample selection bias correction theory
Cortes C., Mohri M., Riley M., Rostamizadeh A.., 2008
Logistic regression in rare events data
King G., Zeng L.., 2001
Analysis of complex survey samples
Lumley T.., 2004
Using sample survey weights in multiple regression analyses of stratified samples
Dumouchel W., Duncan G.., 1983
Cost-sensitive learning by cost-proportionate example weighting
Zadrozny B., Langford J., Abe N.., 0
On sample selection bias and its efficient correction via model averaging and unlabeled examples
Fan W., Davidson I.., 0
The foundations of cost-sensitive learning
Elkan C.., 0
A generalization of sampling without replacement from a finite universe
Horvitz D., Thompson D.., 1952
Estimation of regression coefficients when some regressors are not always observed
Robins J., Rotnitzky A., Zhao L.., 1994
Bagging predictors
Breiman L.., 1996
SMOTE: synthetic minority over-sampling technique
Chawla N., Bowyer K., Hall L., Kegelmeyer W.., 2002
Regression
Fahrmeir L., Kneib T., Lang S.., 2009
Random forests
Breiman L.., 2001

Hastie T., Tibshirani R., Friedman J.., 2001
An introduction to ROC analysis
Fawcett T.., 2006

Core R.., 2015
ranger: a fast implementation of random forests for high dimensional data in C++ and R
Wright M., Ziegler A.., 2017

Meyer D., Dimitriadou E., Hornik A., Weingessel. null, Leisch F.., 2015
smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE
Siriseriwan W.., 0
pROC: an open-source package for R and S+ to analyze and compare ROC curves.
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Muller M., BMC Bioinformatics 12(), 2011
PMID: 21414208
ROCR: visualizing classifier performance in R.
Sing T, Sander O, Beerenwinkel N, Lengauer T., Bioinformatics 21(20), 2005
PMID: 16096348
OpenML: Networked Science in Machine Learning
Vanschoren J., van J., Bischl B., Torgo L.., 2014

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 29312464
PubMed | Europe PMC

Suchen in

Google Scholar