Clustering on Sentiment Analysis: Effect of Twitter Dataset
DOI:
https://doi.org/10.37934/araset.51.1.3951Keywords:
Auto Labeling, Clustering, Deep Learning, LSTM, Sentiment AnalysisAbstract
The process of labeling text datasets presents a challenge in sentiment analysis, especially those done manually. This is because it takes time, effort, and skill which is taxing in Twitter data labeling. This study aims to auto-label Twitter dataset using a clustering approach to classify tourism twitter sentiment using one of the LSTM (Long Short Term Memory) deep learning algorithms. The clustering used for the auto labeling process is K-means, while the deep learning sentiment classification used is LSTM. The research datasets consist of 10,228 tweets about Yogyakarta tourism in Indonesia. The Twitter data language used in this study is Indonesian. The classification process using LSTM is carried out twice, the first process uses a manual label dataset, and the second process uses an auto-labeling dataset. The sentiment class is divided into 3, namely negative, positive and neutral. The results indicates that the classification of tourism twitter sentiment using the auto-labeling dataset provide better accuracy results than the manual-labeling dataset. LSTM classification model with auto-labeling dataset produces optimum graphs with an average accuracy of 99% while manual-labeling datasets produce overfitting charts with an average accuracy of 40%. The results showed that the auto-labeling process of the class dataset using K-Means clustering can improve the accuracy of the classification results of Yogyakarta tourism Twitter sentiment. The model produced in this study can help in solving class labeling problems in sentiment classification.