Obesity Predictor Identification: Comparison of Correlation Based Feature Selection Method and Wrapper Method on Nutrition Dataset
DOI:
https://doi.org/10.37934/araset.49.1.129138Keywords:
Classification Algorithm, Correlation Based Feature Selection, Feature Selection, Obesity Prediction, Wrapper MethodAbstract
The prevalence of obesity among Malaysians is estimated by calculating the obesity prevalence percentage using BMI prevalence data from the national health morbidity survey (NHMS). However, the nutrition data from the NHMS has not been used to predict the national obesity prevalence as it was collected solely for the documentation of an analysis report on the food consumption patterns of the base population. To address this gap, this study utilises nutrition data by employing 15 nutrition variables derived from grocery data to predict obesity. This paper seeks to identify the appropriate nutrition variable, which involved exploring 8238 rows of raw grocery data (grocery receipt) collected from 35 households. During the data pre-processing phase, 15 nutrition variables were generated in the data conversion and data transformation phase of the data pre-processing phase of this study. This study predicts the percentage of selected nutrition variables that could lead to obesity in individuals. The purpose of this study is to find alternative data (grocery data) that can be used to predict obesity and to test the relevance of using that alternative data in predicting obesity by evaluating the accuracy performance measurement of the prediction through the use of data mining technology. This study predicts the percentage of macronutrients variables that could lead to obesity in individuals. To simplify the prediction model, the dataset variables were filtered using the automated feature selection method in the WEKA machine learning tool version 3.8. The objective of the feature selection performance of variables from the dataset was to identify the nutrition variables that have the most significant impact on developing accurate prediction models by evaluating the accuracy performance of the model using area under curve score (AUC). The generated nutrition dataset was subjected to the subset method known as correlation-based-feature-selection (CFS) and wrapper methods that included a learning algorithm in the attribute selection process. Several subsets were extracted during the feature selection phase, which served as potential input datasets (predictor) for developing obesity prediction models using different classification algorithms. Based on the feature selection evaluation conducted in this study, the CFS method was found to be the best feature selection method compared to the three wrapper methods conducted, which resulted in the selection of calorie_intake and foodpyramid_level3% variables as the appropriate predictors for this study. These results can enhance the reliability of using household grocery data to predict obesity and open new avenues for research into nutrition and health prediction.