UPSI Digital Repository (UDRep)
Start | FAQ | About
Menu Icon

QR Code Link :

Type :article
Subject :QA75 Electronic computers. Computer science
ISSN :0960-0779
Main Author :Alamoodi, Abdullah Hussein
Additional Authors :Zaidan, Bilal Bahaa
Zaidan, A. A.
Albahri, O. S.
Chyad, M. A.
Garfan, Salem S.
Title :Machine learning-based imputation soft computing approach for large missing scale and non-reference data imputation
Place of Production :Tanjung Malim
Publisher :Fakulti Seni, Komputeran Dan Industri Kreatif
Year of Publication :2021
Notes :Chaos, Solitons and Fractals
Corporate Name :Universiti Pendidikan Sultan Idris
HTTP Link :Click to view web link

Abstract : Universiti Pendidikan Sultan Idris
Missing data is a common problem in real-world data sets and it is amongst the most complex topics in computer science and many other research domains. The common ways to cope with missing values are either by elimination or imputation depending of the volume of the missing data and its distribution nature. It becomes imperative to come up with new imputation approaches along with efficient algorithms. Though most existing imputation methods focus on a moderate amount of missing data, imputation for high missing rates over 80% is still important but challenging. Even with the existence of some works in addressing high missing volume issue, they mostly rely on imputing reference dataset (Complete Datasets for evaluation) after they create artificial missing values and impute it to measure the accuracy of their proposed techniques. So far, the option of imputing high proportions of missing values with no reference comparison dataset (Original Dataset with highly missing values) have been often ignored or overlooked. Therefore, we propose a missing data imputation approach for high volumes of missing values with no reference comparison dataset. The approach makes use of pre-processing measures and breaking the dataset into small continuous non-missing portions then using Multi Criteria Decision-making analysis to select a portion of data which is representative of the entire broken datasets. This portion helps to create reference comparisons and expands the missing dataset through artificial missing-making procedures with different percentages and imputation using different machine learning techniques. This study conducted two experiments using BMI datasets with more than 80% of missing values, derived from the National Child Development Centre (NCDRC) at Sultan Idris Education University (UPSI), Malaysia. The results show that our approach capability in reconstructing datasets with huge missing values. ? 2021 Elsevier Ltd

References

Aittokallio, T. (2009). Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Briefings in Bioinformatics, 11(2), 253-264. doi:10.1093/bib/bbp059

Alsalem, M. A., Zaidan, A. A., Zaidan, B. B., Hashim, M., Albahri, O. S., Albahri, A. S., . . . Mohammed, K. I. (2018). Systematic review of an automated multiclass detection and classification system for acute leukaemia in terms of evaluation and benchmarking, open challenges, issues and methodological aspects. Journal of Medical Systems, 42(11) doi:10.1007/s10916-018-1064-9

Aschengrau, A., Gallagher, L. G., Winter, M. R., Vieira, V. M., Janulewicz, P. A., Webster, T. F., & Ozonoff, D. M. (2016). No association between unintentional head injuries and early-life exposure to tetrachloroethylene (PCE)-contaminated drinking water. Journal of Occupational and Environmental Medicine, 58(10), 1040-1045. doi:10.1097/JOM.0000000000000850

Beaulieu-Jones, B. K., Moore, J. H., & The Pooled Resource Open-Access ALS Clinical Trials Consortium. (2017). Missing data imputation in the electronic health record using deeply learned autoencoders. Pacific Symposium on Biocomputing, 0, 207-218. doi:10.1142/9789813207813_0021

Becker, D. R., Miao, A., Duncan, R., & McClelland, M. M. (2014). Behavioral self-regulation and executive function both predict visuomotor skills and early academic achievement. Early Childhood Research Quarterly, 29(4), 411-424. doi:10.1016/j.ecresq.2014.04.014

Bethlehem, J. (2009). Applied survey methods: A statistical perspective. Applied survey methods: A statistical perspective (pp. 1-375) doi:10.1002/9780470494998 Retrieved from www.scopus.com

Caemmerer, J. M., & Keith, T. Z. (2015). Longitudinal, reciprocal effects of social skills and achievement from kindergarten to eighth grade. Journal of School Psychology, 53(4), 265-281. doi:10.1016/j.jsp.2015.05.001

Caillault, É. P., Lefebvre, A., & Bigand, A. (2017). Dynamic time warping-based imputation for univariate time series data. Pattern Recognition Letters, Retrieved from www.scopus.com

Chen, X., Wei, Z., Li, Z., Liang, J., Cai, Y., & Zhang, B. (2017). Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation. Knowledge-Based Systems, 132, 249-262. doi:10.1016/j.knosys.2017.06.010

Cole, T. J., & Lobstein, T. (2012). Extended international (IOTF) body mass index cut-offs for thinness, overweight and obesity. Pediatric Obesity, 7(4), 284-294. doi:10.1111/j.2047-6310.2012.00064.x

Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006). Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59(10), 1087-1091. doi:10.1016/j.jclinepi.2006.01.014

Du, J., Chen, H., & Zhang, W. (2019). A deep learning method for data recovery in sensor networks using effective spatio-temporal correlation data. Sensor Review, 39(2), 208-217. doi:10.1108/SR-02-2018-0039

Eirola, E., Doquire, G., Verleysen, M., & Lendasse, A. (2013). Distance estimation in numerical data sets with missing values. Information Sciences, 240, 115-128. doi:10.1016/j.ins.2013.03.043

Farhangfar, A., Kurgan, L. A., & Pedrycz, W. (2007). A novel framework for imputation of missing values in databases. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans, 37(5), 692-709. doi:10.1109/TSMCA.2007.902631

Fedushko, S., Gregus Ml, M., & Ustyianovych, T. (2019). Medical card data imputation and patient psychological and behavioral profile construction. Paper presented at the Procedia Computer Science, , 160 354-361. doi:10.1016/j.procs.2019.11.080 Retrieved from www.scopus.com

Flouri, E., Midouhas, E., & Joshi, H. (2014). The role of urban neighbourhood green space in children's emotional and behavioural resilience. Journal of Environmental Psychology, 40, 179-186. doi:10.1016/j.jenvp.2014.06.007

Garciarena, U., & Santana, R. (2017). An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Systems with Applications, 89, 52-65. doi:10.1016/j.eswa.2017.07.026

Gheyas, I. A., & Smith, L. S. (2010). A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing, 73(16-18), 3039-3065. doi:10.1016/j.neucom.2010.06.021

Goelman, H., Zdaniuk, B., Boyce, W. T., Armstrong, J. M., & Essex, M. J. (2014). Maternal mental health, child care quality, and children's behavior. Journal of Applied Developmental Psychology, 35(4), 347-356. doi:10.1016/j.appdev.2014.05.003

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? some practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206-213. doi:10.1007/s11121-007-0070-9

Harel, O., & Zhou, X. -. (2007). Multiple imputation: Review of theory, implementation and software. Statistics in Medicine, 26(16), 3057-3077. doi:10.1002/sim.2787

Hernández-Pereira, E. M., Álvarez-Estévez, D., & Moret-Bonillo, V. (2015). Automatic classification of respiratory patterns involving missing data imputation techniques. Biosystems Engineering, 138, 65-76. doi:10.1016/j.biosystemseng.2015.06.011

Janik, M., Bossew, P., & Kurihara, O. (2018). Machine learning methods as a tool to analyse incomplete or irregularly sampled radon time series data. Science of the Total Environment, 630, 1155-1167. doi:10.1016/j.scitotenv.2018.02.233

Janssen, K. J. M., Donders, A. R. T., Harrell Jr., F. E., Vergouwe, Y., Chen, Q., Grobbee, D. E., & Moons, K. G. M. (2010). Missing covariate data in medical research: To impute is better than to ignore. Journal of Clinical Epidemiology, 63(7), 721-727. doi:10.1016/j.jclinepi.2009.12.008

Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50(2), 105-115. doi:10.1016/j.artmed.2010.05.002

Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50(2), 105-115. doi:10.1016/j.artmed.2010.05.002

Jiang, Y., Chen, S., McGuire, D., Chen, F., Liu, M., Iacono, W. G., . . . Liu, D. J. (2018). Proper conditional analysis in the presence of missing data: Application to large scale meta-analysis of tobacco use phenotypes. PLoS Genetics, 14(7) doi:10.1371/journal.pgen.1007452

Kapelner, A., & Bleich, J. (2015). Prediction with missing data via bayesian additive regression trees. Canadian Journal of Statistics, 43(2), 224-239. doi:10.1002/cjs.11248

Kiasari, M. A., Jang, G. -., & Lee, M. (2017). Novel iterative approach using generative and discriminative models for classification with missing features. Neurocomputing, 225, 23-30. doi:10.1016/j.neucom.2016.11.015

Kremer, K. P., Flower, A., Huang, J., & Vaughn, M. G. (2016). Behavior problems and children's academic achievement: A test of growth-curve models with gender and racial differences. Children and Youth Services Review, 67, 95-104. doi:10.1016/j.childyouth.2016.06.003

Lê, F., Diez Roux, A., & Morgenstern, H. (2013). Effects of child and adolescent health on educational progress. Social Science and Medicine, 76(1), 57-66. doi:10.1016/j.socscimed.2012.10.005

Lee, S. J., Altschul, I., & Gershoff, E. T. (2015). Wait until your father gets home? mother's and fathers' spanking and development of child aggression. Children and Youth Services Review, 52, 158-166. doi:10.1016/j.childyouth.2014.11.006

Li, S. C. -., Jiang, B., & Marlin, B. (2019). Misgan: Learning from incomplete data with generative adversarial networks. ArXiv Preprint, Retrieved from www.scopus.com

Li, Y., & Parker, L. E. (2014). Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks. Information Fusion, 15(1), 64-79. doi:10.1016/j.inffus.2012.08.007

Li, Z., Sharaf, M. A., Sitbon, L., Sadiq, S., Indulska, M., & Zhou, X. (2014). A web-based approach to data imputation. World Wide Web, 17(5), 873-897. doi:10.1007/s11280-013-0263-z

Liew, A. W. -., Law, N. -., & Yan, H. (2011). Missing value imputation for gene expression data: Computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5), 498-513. doi:10.1093/bib/bbq080

Lin, D., Sun, H., & Zhang, X. (2016). Bidirectional relationship between visual spatial skill and chinese character reading in chinese kindergartners: A cross-lagged analysis. Contemporary Educational Psychology, 46, 94-100. doi:10.1016/j.cedpsych.2016.04.008

Lin, W. -., & Tsai, C. -. (2020). Missing value imputation: A review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53(2), 1487-1509. doi:10.1007/s10462-019-09709-4

Lin, W. -., & Tsai, C. -. J. A. I. R. (2019). , 1-23. Retrieved from www.scopus.com

Little, R. J., D'Agostino, R., Cohen, M. L., Dickersin, K., Emerson, S. S., Farrar, J. T., . . . Stern, H. (2012). The prevention and treatment of missing data in clinical trials. New England Journal of Medicine, 367(14), 1355-1360. doi:10.1056/NEJMsr1203730

Little, R. J. A., & Rubin, D. B. (1987). Statistical Analysis with Missing Data, Retrieved from www.scopus.com

Liu, Y., Dillon, T., Yu, W., Rahayu, W., & Mostafa, F. (2020). Missing value imputation for industrial IoT sensor data with large gaps. IEEE Internet of Things Journal, 7(8), 6855-6867. doi:10.1109/JIOT.2020.2970467

McCormick, M. P., O'Connor, E. E., & Barnes, S. P. (2016). Mother-child attachment styles and math and reading skills in middle childhood: The mediating role of children's exploration and engagement. Early Childhood Research Quarterly, 36, 295-306. doi:10.1016/j.ecresq.2016.01.011

McCormick, M. P., O'Connor, E. E., Cappella, E., & McClowry, S. G. (2013). Teacher-child relationships and academic achievement: A multilevel propensity score model approach. Journal of School Psychology, 51(5), 611-624. doi:10.1016/j.jsp.2013.05.001

McDonald, C. M., Olofin, I., Flaxman, S., Fawzi, W. W., Spiegelman, D., Caulfield, L. E., . . . Danaei, G. (2013). The effect of multiple anthropometric deficits on child mortality: Meta-analysis of individual data in 10 prospective studies from developing countries. American Journal of Clinical Nutrition, 97(4), 896-901. doi:10.3945/ajcn.112.047639

Mesquita, D. P. P., Gomes, J. P. P., Souza Junior, A. H., & Nobre, J. S. (2017). Euclidean distance estimation in incomplete datasets. Neurocomputing, 248, 11-18. doi:10.1016/j.neucom.2016.12.081

Miller, M. R., Müller, U., Giesbrecht, G. F., Carpendale, J. I., & Kerns, K. A. (2013). The contribution of executive function and social understanding to preschoolers' letter and math skills. Cognitive Development, 28(4), 331-349. doi:10.1016/j.cogdev.2012.10.005

Mohamed, M. H., Abdel-rahiem, A. H., & Abdelsamea, M. M. (2014). Scalable algorithms for missing value imputation. Int.J.Computer Applications, 87(11), 35-42. Retrieved from www.scopus.com

Nagarajan, G., & Dhinesh Babu, L. D. (2019). A hybrid of whale optimization and late acceptance hill climbing based imputation to enhance classification performance in electronic health records. Journal of Biomedical Informatics, 94 doi:10.1016/j.jbi.2019.103190

Nagy, K. (2020). Term structure estimation with missing data: Application for emerging markets. Quarterly Review of Economics and Finance, 75, 347-360. doi:10.1016/j.qref.2019.04.002

Önüt, S., Kara, S. S., & Işik, E. (2009). Long term supplier selection using a combined fuzzy MCDM approach: A case study for a telecommunication company. Expert Systems with Applications, 36(2 PART 2), 3887-3895. doi:10.1016/j.eswa.2008.02.045

Opricovic, S., & Tzeng, G. -. (2004). Compromise solution by MCDM methods: A comparative analysis of VIKOR and TOPSIS. European Journal of Operational Research, 156(2), 445-455. doi:10.1016/S0377-2217(03)00020-1

Paradis, A. D., Fitzmaurice, G. M., Koenen, K. C., & Buka, S. L. (2015). A prospective investigation of neurodevelopmental risk factors for adult antisocial behavior combining official arrest records and self-reports. Journal of Psychiatric Research, 68, 363-370. doi:10.1016/j.jpsychires.2015.04.030

Price, M., Higa-McMillan, C., Kim, S., & Frueh, B. C. (2013). Trauma experience in children and adolescents: An assessment of the effects of trauma type and role of interpersonal proximity. Journal of Anxiety Disorders, 27(7), 652-660. doi:10.1016/j.janxdis.2013.07.009

Purwar, A., & Singh, S. K. (2015). Hybrid prediction model with missing value imputation for medical data. Expert Systems with Applications, 42(13), 5621-5631. doi:10.1016/j.eswa.2015.02.050

Qin, Y., Zhang, S., Zhu, X., Zhang, J., & Zhang, C. (2009). POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications, 36(2 PART 2), 2794-2804. doi:10.1016/j.eswa.2008.01.059

Rahman, G., & Islam, Z. (2011). A decision tree-based missing value imputation technique for data pre-processing[C]. Proceedings of the Ninth Australasian Data Mining Conference, 121, 41-50. Retrieved from www.scopus.com

Raymond, M. R., & Roberts, D. M. (1987). A comparison of methods for treating incomplete data in selection research. Educational and Psychological Measurement, 47(1), 13-26. Retrieved from www.scopus.com

Razavi-Far, R., Cheng, B., Saif, M., & Ahmadi, M. (2020). Similarity-learning information-fusion schemes for missing data imputation. Knowledge-Based Systems, 187 doi:10.1016/j.knosys.2019.06.013

Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst's perspective. Multivariate Behavioral Research, 33(4), 545-571. doi:10.1207/s15327906mbr3304_5

Shah, R., Mullany, L. C., Darmstadt, G. L., Mannan, I., Rahman, S. M., Talukder, R. R., . . . Baqui, A. H. (2014). Incidence and risk factors of preterm birth in a rural bangladeshi cohort. BMC Pediatrics, 14(1) doi:10.1186/1471-2431-14-112

Shah, R., Mullany, L. C., Darmstadt, G. L., Talukder, R. R., Rahman, S. M., Mannan, I., . . . on behalf of the ProjAHNMo Study Group in Bangladesh. (2014). Determinants and pattern of care seeking for preterm newborns in a rural bangladeshi cohort. BMC Health Services Research, 14(1) doi:10.1186/1472-6963-14-417

Shah, R., Mullany, L. C., Darmstadt, G. L., Talukder, R. R., Rahman, S. M., Mannan, I., . . . ProjAHNMo Study Group in Bangladesh. (2014). Neonatal mortality risks among preterm births in a rural bangladeshi cohort. Paediatric and Perinatal Epidemiology, 28(6), 510-520. doi:10.1111/ppe.12145

Shiwaku, K., Anuurad, E., Enkhmaa, B., Kitajima, K., & Yamane, Y. (2004). Appropriate BMI for asian populations. Lancet, 363(9403), 157-163. Retrieved from www.scopus.com

Staff, J., Maggs, J. L., Cundiff, K., & Evans-Polce, R. J. (2016). Childhood cigarette and alcohol use: Negative links with adjustment. Addictive Behaviors, 62, 122-128. doi:10.1016/j.addbeh.2016.06.022

Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., . . . Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ (Online), 339(7713), 157-160. doi:10.1136/bmj.b2393

Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890-908. doi:10.1109/32.962560

Sunny, B. S., Elze, M., Chihana, M., Gondwe, L., Crampin, A. C., Munkhondya, M., . . . Glynn, J. R. (2017). Failing to progress or progressing to fail? age-for-grade heterogeneity and grade repetition in primary schools in karonga district, northern malawi. International Journal of Educational Development, 52, 68-80. doi:10.1016/j.ijedudev.2016.10.004

Tamayo, C., Manlhiot, C., Patterson, K., Lalani, S., & McCrindle, B. W. (2015). Longitudinal evaluation of the prevalence of Overweight/Obesity in children with congenital heart disease. Canadian Journal of Cardiology, 31(2), 117-123. doi:10.1016/j.cjca.2014.08.024

Tharayil, J. J., Chiang, S., Moss, R., Stern, J. M., Theodore, W. H., & Goldenholz, D. M. (2017). A big data approach to the development of mixed-effects models for seizure count data. Epilepsia, 58(5), 835-844. doi:10.1111/epi.13727

Vandecandelaere, M., Vansteelandt, S., De Fraine, B., & Van Damme, J. (2016). The effects of early grade retention: Effect modification by prior achievement and age. Journal of School Psychology, 54, 77-93. doi:10.1016/j.jsp.2015.10.004

Vazifehdan, M., Moattar, M. H., & Jalali, M. (2019). A hybrid bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recurrence prediction. Journal of King Saud University - Computer and Information Sciences, 31(2), 175-184. doi:10.1016/j.jksuci.2018.01.002

Velasco-Gallego, C., & Lazakis, I. (2020). Real-time data-driven missing data imputation for short-term sensor data of marine systems. A comparative study. Ocean Engineering, 218 doi:10.1016/j.oceaneng.2020.108261

Yoon, S. (2017). Child maltreatment characteristics as predictors of heterogeneity in internalizing symptom trajectories among children in the child welfare system. Child Abuse and Neglect, 72, 247-257. doi:10.1016/j.chiabu.2017.08.022

Yoon, S., & Sull, S. (2020). Gamin: Generative adversarial multiple imputation network for highly missing data. Paper presented at the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 8453-8461. doi:10.1109/CVPR42600.2020.00848 Retrieved from www.scopus.com

Zhang, S. (2012). Nearest neighbor selection for iteratively kNN imputation. Journal of Systems and Software, 85(11), 2541-2552. doi:10.1016/j.jss.2012.05.073

Zhang, W., Yang, Y., & Wang, Q. (2011). Handling missing data in software effort prediction with naive bayes and em algorithm. Paper presented at the ACM International Conference Proceeding Series, doi:10.1145/2020390.2020394 Retrieved from www.scopus.com

Zhou, X. -. (2020). Challenges and strategies in analysis of missing data. Biostatistics and Epidemiology, 4(1), 15-23. doi:10.1080/24709360.2018.1469810

Zhou, X. -., Zhou, C., Lui, D., & Ding, X. (2014). Missing data concepts and motivating examples. Applied Missing Data Analysis in the Health Sciences, Retrieved from www.scopus.com

Zhu, X., Zhang, S., Jin, Z., & Zhang, Z. (2010). 23 Retrieved from www.scopus.com


This material may be protected under Copyright Act which governs the making of photocopies or reproductions of copyrighted materials.
You may use the digitized material for private study, scholarship, or research.

Back to previous page

Installed and configured by Bahagian Automasi, Perpustakaan Tuanku Bainun, Universiti Pendidikan Sultan Idris
If you have enquiries, kindly contact us at pustakasys@upsi.edu.my or 016-3630263. Office hours only.