Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
Krautenbacher N, Theis FJ, Fuchs C (2017)
Computational and Mathematical Methods in Medicine 2017: 7847531.
Zeitschriftenaufsatz
| Veröffentlicht | Englisch
Download
Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!
Autor*in
Krautenbacher, Norbert;
Theis, Fabian J.;
Fuchs, ChristianeUniBi
Abstract / Bemerkung
Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.
Erscheinungsjahr
2017
Zeitschriftentitel
Computational and Mathematical Methods in Medicine
Band
2017
Art.-Nr.
7847531
ISSN
1748-670X
eISSN
1748-6718
Page URI
https://pub.uni-bielefeld.de/record/2934011
Zitieren
Krautenbacher N, Theis FJ, Fuchs C. Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine. 2017;2017: 7847531.
Krautenbacher, N., Theis, F. J., & Fuchs, C. (2017). Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine, 2017, 7847531. doi:10.1155/2017/7847531
Krautenbacher, Norbert, Theis, Fabian J., and Fuchs, Christiane. 2017. “Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies”. Computational and Mathematical Methods in Medicine 2017: 7847531.
Krautenbacher, N., Theis, F. J., and Fuchs, C. (2017). Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine 2017:7847531.
Krautenbacher, N., Theis, F.J., & Fuchs, C., 2017. Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine, 2017: 7847531.
N. Krautenbacher, F.J. Theis, and C. Fuchs, “Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies”, Computational and Mathematical Methods in Medicine, vol. 2017, 2017, : 7847531.
Krautenbacher, N., Theis, F.J., Fuchs, C.: Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and Mathematical Methods in Medicine. 2017, : 7847531 (2017).
Krautenbacher, Norbert, Theis, Fabian J., and Fuchs, Christiane. “Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies”. Computational and Mathematical Methods in Medicine 2017 (2017): 7847531.
Daten bereitgestellt von European Bioinformatics Institute (EBI)
Zitationen in Europe PMC
Daten bereitgestellt von Europe PubMed Central.
37 References
Daten bereitgestellt von Europe PubMed Central.
Case-Control Studies. Design, Conduct, Analysis.
Rossiter C., Schlesselman J.., 1983
Rossiter C., Schlesselman J.., 1983
Validation and updating of predictive logistic regression models: a study on sample size and shrinkage.
Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD., Stat Med 23(16), 2004
PMID: 15287085
Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD., Stat Med 23(16), 2004
PMID: 15287085
Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods.
Huang Y, Pepe MS., Stat Med 29(13), 2010
PMID: 20527013
Huang Y, Pepe MS., Stat Med 29(13), 2010
PMID: 20527013
A Note on Risk Prediction for Case-Control Studies, 2008
Rose S., van M.., 0
Rose S., van M.., 0
A simple method to adjust clinical prediction models to local circumstances.
Janssen KJ, Vergouwe Y, Kalkman CJ, Grobbee DE, Moons KG., Can J Anaesth 56(3), 2009
PMID: 19247740
Janssen KJ, Vergouwe Y, Kalkman CJ, Grobbee DE, Moons KG., Can J Anaesth 56(3), 2009
PMID: 19247740
A two stage design for the study of the relationship between a rare exposure and a rare disease.
White JE., Am. J. Epidemiol. 115(1), 1982
PMID: 7055123
White JE., Am. J. Epidemiol. 115(1), 1982
PMID: 7055123
Two-stage designs for gene-disease association studies with sample size constraints.
Satagopan JM, Venkatraman ES, Begg CB., Biometrics 60(3), 2004
PMID: 15339280
Satagopan JM, Venkatraman ES, Begg CB., Biometrics 60(3), 2004
PMID: 15339280
Secondary analysis under cohort sampling designs using conditional likelihood
Saarela O., Kulathinal S., Karvanen J.., 2012
Saarela O., Kulathinal S., Karvanen J.., 2012
Baseline integrated behavioural and biological assessment among most at-risk populations in six high-prevalence states of India: design and implementation challenges.
Saidel T, Adhikary R, Mainkar M, Dale J, Loo V, Rahman M, Ramesh BM, Paranjape RS., AIDS 22 Suppl 5(), 2008
PMID: 19098477
Saidel T, Adhikary R, Mainkar M, Dale J, Loo V, Rahman M, Ramesh BM, Paranjape RS., AIDS 22 Suppl 5(), 2008
PMID: 19098477
Health-related characteristics of men who have sex with men: a comparison of those living in "gay ghettos" with those living elsewhere.
Mills TC, Stall R, Pollack L, Paul JP, Binson D, Canchola J, Catania JA., Am J Public Health 91(6), 2001
PMID: 11392945
Mills TC, Stall R, Pollack L, Paul JP, Binson D, Canchola J, Catania JA., Am J Public Health 91(6), 2001
PMID: 11392945
An empirical comparison of respondent-driven sampling, time location sampling, and snowball sampling for behavioral surveillance in men who have sex with men, Fortaleza, Brazil.
Kendall C, Kerr LR, Gondim RC, Werneck GL, Macena RH, Pontes MK, Johnston LG, Sabin K, McFarland W., AIDS Behav 12(4 Suppl), 2008
PMID: 18389357
Kendall C, Kerr LR, Gondim RC, Werneck GL, Macena RH, Pontes MK, Johnston LG, Sabin K, McFarland W., AIDS Behav 12(4 Suppl), 2008
PMID: 18389357
Learning and evaluating classifiers under sample selection bias
Zadrozny B.., 0
Zadrozny B.., 0
Sample selection bias as a specification error
Heckman J.., 1979
Heckman J.., 1979
Sample selection bias correction theory
Cortes C., Mohri M., Riley M., Rostamizadeh A.., 2008
Cortes C., Mohri M., Riley M., Rostamizadeh A.., 2008
Logistic regression in rare events data
King G., Zeng L.., 2001
King G., Zeng L.., 2001
Analysis of complex survey samples
Lumley T.., 2004
Lumley T.., 2004
Using sample survey weights in multiple regression analyses of stratified samples
Dumouchel W., Duncan G.., 1983
Dumouchel W., Duncan G.., 1983
Cost-sensitive learning by cost-proportionate example weighting
Zadrozny B., Langford J., Abe N.., 0
Zadrozny B., Langford J., Abe N.., 0
On sample selection bias and its efficient correction via model averaging and unlabeled examples
Fan W., Davidson I.., 0
Fan W., Davidson I.., 0
The foundations of cost-sensitive learning
Elkan C.., 0
Elkan C.., 0
A generalization of sampling without replacement from a finite universe
Horvitz D., Thompson D.., 1952
Horvitz D., Thompson D.., 1952
Estimation of regression coefficients when some regressors are not always observed
Robins J., Rotnitzky A., Zhao L.., 1994
Robins J., Rotnitzky A., Zhao L.., 1994
Bagging predictors
Breiman L.., 1996
Breiman L.., 1996
Using Inverse Probability Bootstrap Sampling to Eliminate Sample Induced Bias in Model Based Analysis of Unequal Probability Samples.
Nahorniak M, Larsen DP, Volk C, Jordan CE., PLoS ONE 10(6), 2015
PMID: 26126211
Nahorniak M, Larsen DP, Volk C, Jordan CE., PLoS ONE 10(6), 2015
PMID: 26126211
SMOTE: synthetic minority over-sampling technique
Chawla N., Bowyer K., Hall L., Kegelmeyer W.., 2002
Chawla N., Bowyer K., Hall L., Kegelmeyer W.., 2002
Regression
Fahrmeir L., Kneib T., Lang S.., 2009
Fahrmeir L., Kneib T., Lang S.., 2009
Random forests
Breiman L.., 2001
Breiman L.., 2001
Hastie T., Tibshirani R., Friedman J.., 2001
An introduction to ROC analysis
Fawcett T.., 2006
Fawcett T.., 2006
Core R.., 2015
ranger: a fast implementation of random forests for high dimensional data in C++ and R
Wright M., Ziegler A.., 2017
Wright M., Ziegler A.., 2017
Meyer D., Dimitriadou E., Hornik A., Weingessel. null, Leisch F.., 2015
smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE
Siriseriwan W.., 0
Siriseriwan W.., 0
pROC: an open-source package for R and S+ to analyze and compare ROC curves.
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Muller M., BMC Bioinformatics 12(), 2011
PMID: 21414208
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Muller M., BMC Bioinformatics 12(), 2011
PMID: 21414208
ROCR: visualizing classifier performance in R.
Sing T, Sander O, Beerenwinkel N, Lengauer T., Bioinformatics 21(20), 2005
PMID: 16096348
Sing T, Sander O, Beerenwinkel N, Lengauer T., Bioinformatics 21(20), 2005
PMID: 16096348
OpenML: Networked Science in Machine Learning
Vanschoren J., van J., Bischl B., Torgo L.., 2014
Vanschoren J., van J., Bischl B., Torgo L.., 2014
Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.
DeLong ER, DeLong DM, Clarke-Pearson DL., Biometrics 44(3), 1988
PMID: 3203132
DeLong ER, DeLong DM, Clarke-Pearson DL., Biometrics 44(3), 1988
PMID: 3203132
Export
Markieren/ Markierung löschen
Markierte Publikationen
Web of Science
Dieser Datensatz im Web of Science®Quellen
PMID: 29312464
PubMed | Europe PMC
Suchen in