Date of Award

Fall 12-16-2021

Degree Type


Degree Name

Doctor of Philosophy in Analytic and Data Science


Statistics and Analytical Sciences

Committee Chair/First Advisor

Dr. Joe DeMaio

Committee Member

Dr. Herman "Gene" Ray

Committee Member

Dr. Yifan Zhang

Committee Member

Dr. Xinyan Zhang


Understanding how compensation structures influence overall healthcare costs is a central issue in health economics. Episodes of Care (EoC) is a compensation structure that bundles payments for healthcare interventions that belong to a well-defined health event. Since the variation of clinical pathways can drive the cost of healthcare, this research uses sequences of medical billing codes in Perinatal Episodes of Care claims data to study the extent of that variation by equating it to the number of reproducible clusters found. This research proposes a methodological framework to detect reproducible clusters in an unsupervised problem where the true number of clusters is unknown. The proposed framework utilizes k-medoids clustering as it accommodates string-based distance measures commonly used by categorical sequences while meeting the expectations of physicians that the method returns an existing pathway as the central point and produces clusters of pathways that are edge-connected. Additionally, the proposed framework tests set-based and sequence-based distance measures to determine which may better capture the nature of the variation in the treatment of Perinatal EoC. The framework tests different initialization strategies as well as cluster validation indices, though it focusses on the use of the Prediction Strength (PS) to address the concept of reproducibility in clustering. Since the recovery of the unknown true number of clusters true k is important, a simulation study is performed in Chapter 3 with datasets where true k is known. It was found that the Sum of the Distance k-means ++ (SDK) initialization strategy improved the recovery of true k in the simulation studies when compared with k-means++, though both initialization strategies produced nearly identical results with the episodic data. The Jaccard Distance produced two reproducible clusters while clustering with the Edit Distance did not. When visually inspecting projections of the data to a three-dimensional Euclidean coordinate space, it was found that the clusters produced from the Edit Distance have many outliers, potentially contributing to the lack of reproducibility. Visually overlaying cost information onto the projections of the data found sub-groups that were lower cost, though further improvements to the initialization strategies appear necessary to repeatedly detect these subgroups. This framework demonstrates the ability to distinguish clinically meaningful variation from non-efficacious variation that may impact cost. In doing so, it has the potential to highlight differences in pathways which may be due to physician and/or patient preference.

Included in

Data Science Commons