The Use of Unsupervised Learning for Stylometric Feature Selection in Authorship Verification System

Authors

  • Lucia Dwi Krisnawati Department of Informatics, Faculty of Information Technology, Universitas Kristen Duta Wacana, 55224, Yogyakarta, Indonesia
  • Thomas Widiarya Budiman Department of Informatics, Faculty of Information Technology, Universitas Kristen Duta Wacana, 55224, Yogyakarta, Indonesia
  • Laurentius Kuncoro Purbo Saputra Department of Informatics, Faculty of Information Technology, Universitas Kristen Duta Wacana, 55224, Yogyakarta, Indonesia
  • Su Cheng Haw Faculty of Computing and Informatics, Multimedia University, Persiaran Multimedia, Cyberjaya, Selangor, Malaysia

DOI:

https://doi.org/10.37934/araset.63.1.240254

Keywords:

Author verification system, Stylometric features, Feature selection, Clustering, Classification

Abstract

The recent development on Machine Learning and text processing has made an author verification system reliable enough to solve cases on the authorship problems. Though research on the field of Author Verification (AV) has flourished well, AV system for Indonesian texts has not been fully explored. As one of the underlying problems of AV system is set on feature selection, this research focuses on finding the best combination of stylometric features in an Av system for Indonesian texts. To achieve this goal, 3 lexical features, 2 syntactic and 1 structural feature were combined into 20 feature combination sets. In discriminating these feature combinations, a clustering model, K-means was used and its outputs were measured with Purity score. To validate the robustness of feature combinations, they were experimented in an AV system using MKNN, KNN, and SVM classifiers in 5 experimental scenarios. It turns out that the most robust feature combination is the one containing both syntactic features plus the structural one, that is KF3. This best feature combination was applied to our AV system which was then tested with new datasets. The macro-average F-score of this test achieves 0.79, while the macro-average precision and macro-average sensitivity scores are 0.83 and 0.76 respectively.

Downloads

Download data is not yet available.

Author Biographies

Thomas Widiarya Budiman, Department of Informatics, Faculty of Information Technology, Universitas Kristen Duta Wacana, 55224, Yogyakarta, Indonesia

thomas.widiarya@ti.ukdw.ac.id

Laurentius Kuncoro Purbo Saputra, Department of Informatics, Faculty of Information Technology, Universitas Kristen Duta Wacana, 55224, Yogyakarta, Indonesia

kuncoro@staff.ukdw.ac.id

Downloads

Published

2025-03-17

How to Cite

Dwi Krisnawati, L., Budiman, T. W., Saputra, L. K. P., & Haw , S. C. (2025). The Use of Unsupervised Learning for Stylometric Feature Selection in Authorship Verification System. Journal of Advanced Research in Applied Sciences and Engineering Technology, 63(1), 240–254. https://doi.org/10.37934/araset.63.1.240254

Issue

Section

Articles

Similar Articles

<< < 2 3 4 5 6 7 8 9 10 11 > >> 

You may also start an advanced similarity search for this article.