The Use of Unsupervised Learning for Stylometric Feature Selection in Authorship Verification System
DOI:
https://doi.org/10.37934/araset.63.1.240254Keywords:
Author verification system, Stylometric features, Feature selection, Clustering, ClassificationAbstract
The recent development on Machine Learning and text processing has made an author verification system reliable enough to solve cases on the authorship problems. Though research on the field of Author Verification (AV) has flourished well, AV system for Indonesian texts has not been fully explored. As one of the underlying problems of AV system is set on feature selection, this research focuses on finding the best combination of stylometric features in an Av system for Indonesian texts. To achieve this goal, 3 lexical features, 2 syntactic and 1 structural feature were combined into 20 feature combination sets. In discriminating these feature combinations, a clustering model, K-means was used and its outputs were measured with Purity score. To validate the robustness of feature combinations, they were experimented in an AV system using MKNN, KNN, and SVM classifiers in 5 experimental scenarios. It turns out that the most robust feature combination is the one containing both syntactic features plus the structural one, that is KF3. This best feature combination was applied to our AV system which was then tested with new datasets. The macro-average F-score of this test achieves 0.79, while the macro-average precision and macro-average sensitivity scores are 0.83 and 0.76 respectively.
Downloads
