Effect of typos on text classification accuracy in word and character tokenization

Authors

  • Peter E. Shawky College of Computing and Information Technology, Arab Academy for Science, Technology and Maritime Transport, Alexandria, Egypt
  • Saleh Mesbah ElKaffas College of Computing and Information Technology, Arab Academy for Science, Technology and Maritime Transport, Alexandria, Egypt
  • Shawkat K Guirguis Department of Information Technology, Institute of Graduate Studies and Research, Alexandria University, Alexandria, Egypt

DOI:

https://doi.org/10.37934/araset.40.2.152162

Keywords:

Deep learning, CNN, LSTM, Sentiment Analysis, Binary Classification, Character Tokenization

Abstract

To train a machine to “sense” a users’ feelings through writings (sentiment analysis) has become a crucial process in several domains: marketing, research, surveys and more. Nevertheless in times of crisis like COVID. Typo is one of the underestimated challenges processing user-generated text (comments, tweets, ..etc), it affects both learning and evaluation processes. Word tokenization outcome changes drastically even with a single character change, hence as expected, experiments have shown significant accuracy decreases due to typo. Adding a spelling correction as preprocessing layer, building one for every language, is a very time and resources expensive solution, a huge challenge against large data and real-time processing. Alternatively, a CNN model consuming the same text, once tokenized on characters level and once on words level while inducing typo, showed that as the typo percentage approaches 10% of the text, the results with characters tokens surpasses words tokens. Finally, on %30 typo of the text, the model consuming characters tokenization outperformed itself with the word level by a significant %22.3 in accuracy and %24.9 in F1-Score, using the same exact model. This approach in solving the inevitable typo challenge in NLP proved to be of significant practicality, saving huge resources versus using a spelling-correction model beforehand. It also removes a blocker challenge in front of real-time processing of user-generated text while preserving acceptable accuracy results.

Author Biographies

Peter E. Shawky, College of Computing and Information Technology, Arab Academy for Science, Technology and Maritime Transport, Alexandria, Egypt

eng.peter.ezzatt@gmail.com

Saleh Mesbah ElKaffas, College of Computing and Information Technology, Arab Academy for Science, Technology and Maritime Transport, Alexandria, Egypt

saleh.mesbah@aast.edu

Shawkat K Guirguis, Department of Information Technology, Institute of Graduate Studies and Research, Alexandria University, Alexandria, Egypt

shawkat_g@yahoo.com

Downloads

Published

2024-02-28

How to Cite

Peter E. Shawky, Saleh Mesbah ElKaffas, & Shawkat K Guirguis. (2024). Effect of typos on text classification accuracy in word and character tokenization. Journal of Advanced Research in Applied Sciences and Engineering Technology, 40(2), 152–162. https://doi.org/10.37934/araset.40.2.152162

Issue

Section

Articles