Clustering on Sentiment Analysis: Effect of Twitter Dataset

Sri  Redjeki; Satria  Abadi; Deborah  Kurniati; Sri Rezeki Candra  Nursari; Ariesta  Damayanti; Edi  Iskandar

doi:10.37934/araset.51.1.3951

Authors

Sri Redjeki Indonesia Digital Technology University, Yogyakarta, Indonesia
Satria Abadi Faculty of Computing and Meta Technology, Universiti Pendidikan Sultan Idris, Perak, Malaysia
Deborah Kurniati Indonesia Digital Technology University, Yogyakarta, Indonesia
Sri Rezeki Candra Nursari Universitas Pancasila Jakarta, Indonesia
Ariesta Damayanti Indonesia Digital Technology University, Yogyakarta, Indonesia
Edi Iskandar Indonesia Digital Technology University, Yogyakarta, Indonesia

DOI:

https://doi.org/10.37934/araset.51.1.3951

Keywords:

Auto Labeling, Clustering, Deep Learning, LSTM, Sentiment Analysis

Abstract

The process of labeling text datasets presents a challenge in sentiment analysis, especially those done manually. This is because it takes time, effort, and skill which is taxing in Twitter data labeling. This study aims to auto-label Twitter dataset using a clustering approach to classify tourism twitter sentiment using one of the LSTM (Long Short Term Memory) deep learning algorithms. The clustering used for the auto labeling process is K-means, while the deep learning sentiment classification used is LSTM. The research datasets consist of 10,228 tweets about Yogyakarta tourism in Indonesia. The Twitter data language used in this study is Indonesian. The classification process using LSTM is carried out twice, the first process uses a manual label dataset, and the second process uses an auto-labeling dataset. The sentiment class is divided into 3, namely negative, positive and neutral. The results indicates that the classification of tourism twitter sentiment using the auto-labeling dataset provide better accuracy results than the manual-labeling dataset. LSTM classification model with auto-labeling dataset produces optimum graphs with an average accuracy of 99% while manual-labeling datasets produce overfitting charts with an average accuracy of 40%. The results showed that the auto-labeling process of the class dataset using K-Means clustering can improve the accuracy of the classification results of Yogyakarta tourism Twitter sentiment. The model produced in this study can help in solving class labeling problems in sentiment classification.