Date of Award
Summer 7-21-2020
Degree Type
Dissertation
Degree Name
Doctor of Philosophy in Analytic and Data Science
Department
Statistics and Analytical Sciences
Committee Chair/First Advisor
Herman Ray
Committee Member
Jennifer Priestley
Committee Member
Lin Li
Abstract
Through a review of epistemological frameworks in social sciences, history of frameworks in statistics, as well as the current state of research, we establish that there appears to be no consistent, quantitatively motivated model development framework in data science, and the downstream analysis effects of various modeling choices are not uniformly documented. Examples are provided which illustrate that analytic choices, even if justifiable and statistically valid, have a downstream analysis effect on model results. This study proposes a unified model development framework that allows researchers to make statistically motivated modeling choices within the development pipeline. Additionally, a simulation study is used to determine empirical justification of the proposed framework. This study tests the utility of the proposed framework by investigating the effects of normalization on downstream analysis results. Normalization methods are investigated by utilizing a decomposition of the empirical risk functions, measuring effects on model bias, variance, and irreducible error. Measurements of bias and variance are then applied as diagnostic procedures for model pre-processing and development within the unified framework. Findings from simulation results are included in the proposed framework and stress-tested on benchmark datasets as well as several applications.
Included in
Analysis Commons, Data Science Commons, Other Applied Mathematics Commons, Statistics and Probability Commons