Classification of Breast Cancer Subtypes using Microarray RNA Expression Data

Authors

  • Muhammad Shazwan Suhiman Mathematical Sciences Studies, College of Computing, Informatics, and Mathematics, Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor, Malaysia
  • Sayang Mohd Deni Mathematical Sciences Studies, College of Computing, Informatics, and Mathematics, Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor, Malaysia
  • Ahmad Zia Ul-Saufie Mohamad Japeri D Mathematical Sciences Studies, College of Computing, Informatics, and Media, Universiti Teknologi MARA (UiTM), Pahang Campus, 27600, Raub, Pahang, Malaysia
  • Aszila Asmat D Mathematical Sciences Studies, College of Computing, Informatics, and Media, Universiti Teknologi MARA (UiTM), Pahang Campus, 27600, Raub, Pahang, Malaysia
  • Lirong Wang School of Mathematics and Finance, Science and Technology, hunan University of Humanities, Science and Technology, Loudi, 417000, P.R.China

DOI:

https://doi.org/10.37934/araset.46.1.7585

Keywords:

Breast cancer classification, Feature selections, Machine learning

Abstract

Breast cancer is a heterogeneous disease that involves molecular alteration, cellular alterations, and clinical outcome for which the classification of Breast cancer remains a challenge to diagnose. Current practice uses immunohistochemistry markers and clinical variables to classify Breast cancer, but this approach has limitations due to the inclusion of other tumour subtypes and healthy individuals. Machine learning approaches based on mRNA expression data offer new possibilities for researchers to investigate the potential of molecular biomarkers as one of the diagnostic characteristics. The purpose of this study is to evaluate features (genes) rank through feature selection method for Breast cancer diagnostic test. Three feature selection methods of IG, relief and mRMR were applied and subsets of top 100, 50, 25, 10, 5 and 3 were created. Each subset was tested with SVM, LR and RF classifiers and its performance was assessed using confusion matrix. The result of this study found that the feature selection of IG, reliefF and mRMR was able to achieve highest accuracy with SVM, LR and RF classifier. mRMR with RF classifier achieved highest accuracy with the least number of top rank genes with 25 genes. Hybrid feature selection approached (mRMR + SVM) improved accuracy of top 3 highest rank genes using SVM, LR and RF classifier. Future work should aim to use other feature selection methods and classifiers to explore the classification accuracy with the least features subset in multiclass cancer dataset.

Downloads

Download data is not yet available.

Author Biographies

Muhammad Shazwan Suhiman, Mathematical Sciences Studies, College of Computing, Informatics, and Mathematics, Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor, Malaysia

shazwansuhiman@gmail.com

Sayang Mohd Deni, Mathematical Sciences Studies, College of Computing, Informatics, and Mathematics, Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor, Malaysia

sayan929@uitm.edu.my

Ahmad Zia Ul-Saufie Mohamad Japeri, D Mathematical Sciences Studies, College of Computing, Informatics, and Media, Universiti Teknologi MARA (UiTM), Pahang Campus, 27600, Raub, Pahang, Malaysia

ahmadzia101@uitm.edu.my

Aszila Asmat, D Mathematical Sciences Studies, College of Computing, Informatics, and Media, Universiti Teknologi MARA (UiTM), Pahang Campus, 27600, Raub, Pahang, Malaysia

aszila@uitm.edu.my

Published

2024-04-26

Issue

Section

Articles

Most read articles by the same author(s)