Stress Detection Through Text in Social Media Using Machine Learning Techniques

Fauziah  Kasmin; Nur Afeeqah Irsaleena Razali; Sharifah Sakinah Syed Ahmad; Zuraini Othman; Dian Sa’adilah  Maylawati

doi:10.37934/araset.61.4.161175

Authors

Fauziah Kasmin Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM), Melaka, Malaysia
Nur Afeeqah Irsaleena Razali Faculty of Information and Communication Technology, Universiti Teknikal Malaysia Melaka (UTeM), 76100 Durian Tunggal, Melaka, Malaysia
Sharifah Sakinah Syed Ahmad Faculty of Information and Communication Technology, Universiti Teknikal Malaysia Melaka (UTeM), 76100 Durian Tunggal, Melaka, Malaysia
Zuraini Othman Faculty of Information and Communication Technology, Universiti Teknikal Malaysia Melaka (UTeM), 76100 Durian Tunggal, Melaka, Malaysia
Dian Sa’adilah Maylawati Department of Informatics, UIN Sunan Gunung Djati Bandung, Indonesia

DOI:

https://doi.org/10.37934/araset.61.4.161175

Keywords:

Stress detection, machine learning, social media, text

Abstract

In today's digital era, the prevalence of stress-related discussions on social media platforms such as Twitter, Facebook, Instagram, and Reddit has garnered considerable attention. Human stress causes mental and financial problems, impairs one's ability to think clearly at work, strains relationships with co-workers, depresses oneself, and, in extreme circumstances, can result in suicide. Therefore, identifying stress is crucial to reducing its effects. Stress detection and measurement in the large world of social media data is a difficult and time-consuming task. Hence, this comprehensive review explores the crucial realm of detecting and quantifying stress through user behaviour analysis, leveraging the capabilities of machine learning approach. Our primary goals encompass developing a binary stress detection model, conducting a thorough comparative analysis of machine learning models, and designing an intuitive stress detection dashboard for visualizing data. The study utilizes three distinct datasets: the Reddit dataset containing 3,532 records, the Twitter dataset with 1,228 records, and an integrated dataset combining data from both sources, total 4,760 records. Key techniques for feature extraction, particularly Term Frequency-Inverse Document Frequency (TF-IDF), are employed to extract valuable insights from textual data. The study's findings demonstrate how well some machine learning models perform with various datasets and training/testing splits. Interestingly, the Logistic Regression model performs admirably, with an astounding 73% accuracy on the Reddit dataset. All models perform well on the Twitter dataset, however under certain conditions, the Support Vector Machine model outperforms the others with an amazing 81% accuracy. With an accuracy rating of 74% in the combined dataset, the Support Vector Machine likewise shows up as the best performer. The findings contribute significantly to ongoing efforts in enhancing stress detection, early intervention strategies, and health research within the sphere of social media.