Prediction of Childhood Diarrhea in Bangladesh using Machine Learning Approach

Diarrhea has remained a major health problem among under-five (U5) children that leads high level of morbidity and mortality. This study is to determine the socio-demographic risk factors of diarrhea as well as predict of diarrhea status using machine learning (ML) based approach among U5 children in Bangladesh. Bangladesh Demographic and Health Survey, 2014 dataset is used in this study. This dataset consisted of 7,538 respondents who had 371 (4.9%) child’s diarrhea. Logistic regression (LR) is used to determine the high-risk factors of diarrhea. Then four ML-based approach namely naïve Bayes (NB), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and support vector machine (SVM) was applied to predict the child’s diarrhea status and accuracy, sensitivity, and specificity are used to evaluate the performance of these classifiers. Around 4.9% women reported that their children have experienced an episode of diarrhea in two weeks before the survey. LR model showed that the child’s age, region (Khulna and Rangpur), mothers who had completed secondary education, and respondents who were rich wealth index, significantly associated risk factors for diarrhea disease. Our findings indicate that SVM with radial basis kernel yielded 65.61% accuracy, 66.27% sensitivity, and 52.28% specificity which are comparatively better than others. The prevalence of diarrhea disease is more common among Bangladeshi children. Our study shows that SVM is capable of predicting child diarrhea status (generally highly imbalanced data). This study allows policy makers towards appropriate decisions to reduce childhood diarrhea in Bangladesh.


Introduction
Diarrhea is a major health problem in any developed and developing countries like Bangladesh. Globally, 1 in 9 children died due to diarrhea [1] and diarrhea is second leading cause of mortality [2]. The symptoms of diarrhea are passing loose, three or more times watery stools in a 24-hour period [3]. There were about 1.7 million children cases of diarrhea and 525000 under-five children died due to diarrhea in worldwide [2,4].
In the previous literature, there were lots of studies to identify the risk factors of diarrheal disease in Bangladesh [5][6][7]. Based on our knowledge, there were no studies in Bangladesh to identify and predict childhood diarrhea using machine learning (ML) based approach. In this study, an attempt has been made to identify the risk factors of diarrhea. Moreover, determine the risk variables/factors of childhood diarrhea. Finally, naïve Bayes (NB) [8,12], linear discriminant analysis (LDA) [8,13], quadratic discriminant analysis (QDA) [8,14], and support vector machine (SVM) [8,15] are used to predict the childhood diarrhea. Stata version 14 and Ri86 3.4.2 was used for the analysis.

Result
Baseline characteristics of the respondents Table 1 shows the baseline characteristics of the respondents. Distribution of the prevalence of diarrhea is 4.9% of the total child had diarrhea in the last two weeks before the study. The highest prevalence of diarrhea (6.6%) is in age group 0-11 months and lowest (3.2%) in age group 24-59 months' children. Male children have affected diarrhea by 5.3% and females are 4.6%. The highest prevalence of childhood diarrhea is in Chittagong division (6.4%) and the lowest in Rangpur region (2.6%). The prevalence of diarrhea is higher in rural areas (5.0%) compared to urban areas (4.9%). Table  1 confirms that a lower level of mother's education (primary) causes the highest rate of diarrhea (6.1%) whereas, the lowest rate of diarrhea (2.4%) with very higher mother' education. The respondents with poor socio-economic status is reported the highest prevalence of diarrhea (5.6%) and the middle and rich were both lower (4.5%). In these study areas, 4.9% children are affected by diarrhea in Muslim families and 4.7% Hindu, and 5.1% in others. Table 1 confirms that 4.7% children are affected by diarrhea which family take water from tub well and 9.2% for pond. The Chi-square test and Health Survey (BDHS), 2014, The BDHS, 2014 included a household survey of ever-married women 15-49 years [5] having 7886 respondents. The datset contains some missing and omiiting these missing values 7538 respondents are considered for final analysis.

Potential risk factors
We have divided the child's are into three groups namely (0-11) months, (12-23) months and (24-59) months. Wealth index is divided into five categories (poorest, poorer, middle, richer and richest), but for our calculation, this index is divided into three categories as poor, middle and rich. The main predictor variables are age of the child, sex of child, region, mother's education, wealth index, source of drinking water.

Outcome variable
In this study, we have considered childhood diarrhea as a dependent variable. We have defined the dependent variables as:

Statistical analysis
The Chi-squared test and binary logistic regression (LR) analyses are used as statistical tools. Differences in variables between the children who had diarrhea (yes/no) are analyzed using Chi-Square test for categorical variables. The methodology is used LR based model [8][9][10][11] where the goal is to

Kernel optimization
Support vector machine with five different kernels namely, linear, polynomial, sigmoid, Laplace, and radial basis function (RBF) are adopted in this study. To evaluate the performances of these models, the 10-fold cross-validation protocol is performed and chooses the best kernel which gives the highest classification accuracy. By tuning all of these kernels the best performance are summarized in Table 3. It is observed that the RBF kernel gives the highest classification accuracy of 65.61% along with 66.27% sensitivity and 52.28% specificity. Therefore, the RBF kernel is select for SVM for predicting childhood diarrhea.

Comparison of the classifiers
The comparison of the performances of these classifiers is presented in Table 4. In this study, we have used four classifiers as SVM, NB, LDA, QDA for predicting childhood diarrhea in Bangladesh. The accuracy, sensitivity, and specificity are used to evaluate the performance of these classifiers. It is observed that NB and LDA have greater accuracy and sensitivity are revealed that children's age, region, mother's education, wealth index and source of drinking water are statistically significant associated (p-value < 0.05) with the prevalence of diarrhea of children.

Risk factors for diarrhea using logistic regression
The LR model is used to assess the net effects of socio-demographic variables on childhood diarrhea. Odds ratios (OR) are used to compare different groups with 95% confidence interval (CI) of OR presented in Table 2. Out of eight independent variables, four, viz. age of the child, region, mother's education and wealth index are statistically significant at 5% levels of significance. The child who were in age group 12-23 months (OR, 0.585; 95% CI, 0.464 -0.738) less prevalence of child diarrhea but the child who were in age group 24-59 month's had (OR, 0.458; 95% CI, 0.331-0.633) less prevalence of child diarrhea as compared to the children of age group 0-11 months. A child who was in Khulna division had (OR, 0.550; 95% CI, 0.310-0.979) and Rangpur division had (OR, 0.412; 95% CI, 0.232-0.732) less prevalence of child diarrhea as compared to the children in Barisal division. Mother's with secondary complete or higher had (OR, 0.505; 95% CI, 0.263-0.972) less prevalence of child diarrhea as compared to mother's had no education. Again respondents in rich wealth index

Conclusions
This study investigate the risk factors of childhood diarrhea and also suggest a prediction model to predict childhood diarrhea. This shows that child age, mother's education, region and wealth index are significant impact on childhood diarrhea. In this study, LDA, QDA, NB, and SVM-based classifiers are used to predict the childhood diarrhea status. SVM with radial basis kernel gives better performance compared to others.
than SVM but they failed to detect a single one observation from the small group that is they has zero specificity so that for this data NB and LDA are totally avoided. On the other hand accuracy and sensitivity of QDA is higher but specificity is very low compared to SVM. Hence according to our objectives SVM classifier is the best classifier compared to others. It may conclude that SVM is the best classifier for predicting childhood diarrhea.

Validation of the results of SVM
In order to the validation of the performances of SVM, we have used a simulated dataset. We have simulated/generated 800 observations from which 400 observations are in class 1 and the rest of the observations in class 2. This data set simulated using the normal distribution with different mean and standard deviation. The performance of SVM is presented in Table 5. It is noted that the SVM classifier gives the highest classification accuracy compared to others.

Discussion
This also shows that child's age, region, mother's education and wealth index are found to be significantly associated with childhood diarrhea. Our findings show that child's age has the significant impact on diarrhea. It is observed that the children who ages are 12-23 and 24-59 moths have the lower prevalence of diarrhea. It is also observed that Khulna region and Rangpur region have the lower prevalence of diarrhea compared to Barisal region. The mothers who have completed secondary education is an important factor of child's had diarrhea which were similares finings with the previous studies [16,17]. Because an educated mothers is more conscious about his own life and their children life.
It is noted that wealth index had a significant impact on diarrhea. The children who are come from rich family had the lower prevalence of diarrhea compared to poor [17]. In the previous studies, there was no study for the prediction of childhood diarrhea disease. Best our knowledge, this is the first time, we applied four ML-based approach as NB, LDA, QDA, and SVM to predict the childhood diarrhea. Our study