A Study on the Best Classification Method for an Intelligent Phishing Website Detection System

Nor Hapiza Mohd Ariffin; Muhammad Imtiaz Mohamed Iqbal; Marina Yusoff; Nurul Akhmal Mohd Zulkefli

doi:10.37934/araset.48.2.197210

Authors

Nor Hapiza Mohd Ariffin Faculty of Business, Sohar University, Sohar, Oman
Muhammad Imtiaz Mohamed Iqbal Software Engineering Department, Motorola Solutions, Malaysia
Marina Yusoff Institute for Big Data Analytics and Artificial Intelligence (IBDAAI), University Teknologi MARA (UiTM), Selangor Darul Ehsan, Malaysia
Nurul Akhmal Mohd Zulkefli Dhofar University, Oman

DOI:

https://doi.org/10.37934/araset.48.2.197210

Keywords:

Phishing, classification algorithms, intelligent systems, machine learning

Abstract

It is impossible to imagine our lives without the internet, but it has also meant that malicious acts such as phishing can be carried out anonymously. Phishers use social engineering or fake websites to trick their victims into giving them personal information such as credit card numbers, bank passwords and other sensitive information. However, the number of phishing attacks has increased significantly in the last year and current methods of detecting phishing are ineffective. This study focuses on identifying features of phishing websites, evaluating the best dataset and method for applying machine learning classification algorithms, and developing a prototype phishing detection system using the best classification algorithm model. In this study, the decision tree, logistic regression, and machine learning classification algorithm (k-nearest neighbours) were investigated. In this study, the waterfall methodology of system development life cycle (SDLC) was used. All approaches, strategies, tools and relevant theories were explored to provide an overview and understanding for this study. An extensive literature review was conducted to develop the model and problem statement. Data was collected through an open-source licenced website. In addition, the data was pre-processed before training and building the model to ensure that no noisy data was present. The parameters of the three models, K-nearest neighbours, decision tree and logistic regression, were adjusted to obtain the best possible model result. The models were then evaluated against the confusion matrix, accuracy, precision, recall, f1 score and decision tree to determine the best classification model for phishing and legitimate websites. The models are fine-tuned with the best parameters for each to achieve an optimal result for phishing detection. After evaluating each model, the decision trees were found to be the most accurate in classifying phishing websites with an accuracy of 95%. In the future, the system can be improved through different approaches such as Deep Learning and a fully developed web-based system that can be used in the real world.