A Novel Clustering and Matrix Based Computation for Big Data Dimensionality Reduction and Classification

Jijo Varghese; P. Tamil Selvan

doi:10.37934/araset.32.1.238251

Authors

Jijo Varghese Department of CS, CA&IT, Karpagam Academy of Higher Education, Coimbatore, Tamil Nādu, India
P. Tamil Selvan Department of CS, CA&IT, Karpagam Academy of Higher Education, Coimbatore, Tamil Nādu, India

DOI:

https://doi.org/10.37934/araset.32.1.238251

Keywords:

Big Data, Clustering, Word Pattern, Similarity Measures, Dimensionality Reduction

Abstract

For higher dimensional or "Big Data (BD)" clustering and classification, the dimensions of documents have to be considered. The overhead of classifying methods might also be reduced by resolving the volumetric issue of documents. However, the dimensions of the shortened collection of documents might potentially generate noise and abnormalities. Previous noise and abnormality information removal strategies include several different approaches that have already been established throughout time. To increase classification accuracy, current classifications or new classification methods that has created to conduct classification, must deal with some of the most difficult issues in BD document categorization and clustering. Hence, the goals of this research are derived from the issues that can be solved only by expanding classification accuracy of classifiers. Superior clusters may also be achieved by using effective "Dimensionality Reduction (DR)". As the first step in this research, we introduce a unique DR approach that preserves word frequency in the document collection, allowing the classification algorithm to obtain improved (or) at least equal classification levels of accuracy with a lower dimensionality set of documents. When clustering "Word Patterns (WPs)" during "WP Clustering (WPC)", we imply a new WP "Similarity Function (SF)" for "Similarity Computation (SC)" to be used as part of WPC. DR of the document collection is accomplished with the use of information gained from various WP clusters. Finally, we provide "Similarity Measures" for SC of high dimensional texts and deliver SF for document classification and deliver SF for document classification. With assessment criteria like "Information-Ratio for Dimension-Reduction", "Accuracy", and "Recall", we discovered that the proposed method WP paired with SC (WP-SC) scaled extremely effectively to higher dimensional "Dataset’s (DS)" and surpasses the current technique AFO-MKSVM. According to the findings, the WP-SC approach produced more favorable outcomes than the LDA-SVM and AFO-MKSVM approaches.