The Effect of Balanced and Imbalanced Dataset for Comparing Machine Learning Models in Cancer Survival Prediction with Poverty Status Data
DOI:
https://doi.org/10.37934/araset.59.2.232244Keywords:
Imbalanced dataset, Machine learning, Cancer survival, PredictionAbstract
This study focuses on the performance of machine learning algorithms on balanced and imbalanced datasets on cancer survival prediction with poverty status data. The intricate relationship between cancer survival and poverty was examined, addressing the pressing concern of cancer's substantial impact on mortality rates and the role of socioeconomic status in exacerbating disparities. Despite extensive examinations of the link between cancer mortality and socioeconomic status, little attention has been directed towards cancer survival rooted in poverty. Moreover, prevailing comparative studies typically focus on singular cancer types, leaving a void in comprehensive insights. This study seeks to bridge this gap by employing machine learning algorithms to predict cancer survival, leveraging data from a dataset extracted from SEER STAT. Five machine learning algorithms, namely, Support Vector Machine, Random Forest, Logistic Regression, Decision Tree, and Naïve Bayes were compared in their performances using balanced and imbalanced data with data from those above and below the poverty line. This study delved into class-balancing techniques to mitigate biases arising from imbalanced data, particularly in the context of poverty. The result showed that Support Vector Machine, Random Forest, Logistic Regression, and Naïve Bayes demonstrated stable and excellent performance in dealing with both balanced and imbalanced datasets. However, the performance of the Decision Tree was less satisfactory in this context.