The Effect of Balanced and Imbalanced Dataset for Comparing Machine Learning Models in Cancer Survival Prediction with Poverty Status Data

Michelle Tan; Stephanie Chua; Puteri Nor Ellyza Nohuddin

doi:10.37934/araset.59.1.6577

Authors

Michelle Tan Faculty of Computer Science and Information Technology, University Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Stephanie Chua Faculty of Computer Science and Information Technology, University Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Puteri Nor Ellyza Nohuddin Higher Colleges of Technology, Sharjah Women’s College, 79799 Abu Dhabi, United Arab Emirates

DOI:

https://doi.org/10.37934/araset.59.1.6577

Keywords:

Imbalanced dataset, Machine learning, Cancer survival, Prediction

Abstract

This study focuses on the performance of machine learning algorithms on balanced and imbalanced datasets on cancer survival prediction with poverty status data. The intricate relationship between cancer survival and poverty was examined, addressing the pressing concern of cancer's substantial impact on mortality rates and the role of socioeconomic status in exacerbating disparities. Despite extensive examinations of the link between cancer mortality and socioeconomic status, little attention has been directed towards cancer survival rooted in poverty. Moreover, prevailing comparative studies typically focus on singular cancer types, leaving a void in comprehensive insights. This study seeks to bridge this gap by employing machine learning algorithms to predict cancer survival, leveraging data from a dataset extracted from SEER STAT. Five machine learning algorithms, namely, Support Vector Machine, Random Forest, Logistic Regression, Decision Tree, and Naïve Bayes were compared in their performances using balanced and imbalanced data with data from those above and below the poverty line. This study delved into class-balancing techniques to mitigate biases arising from imbalanced data, particularly in the context of poverty. The result showed that Support Vector Machine, Random Forest, Logistic Regression, and Naïve Bayes demonstrated stable and excellent performance in dealing with both balanced and imbalanced datasets. However, the performance of the Decision Tree was less satisfactory in this context.