UPSI Digital Repository (UDRep)
|
|
|
Abstract : Universiti Pendidikan Sultan Idris |
Missing data is a common problem in real-world data sets and it is amongst the most complex topics in computer science and many other research domains. The common ways to cope with missing values are either by elimination or imputation depending of the volume of the missing data and its distribution nature. It becomes imperative to come up with new imputation approaches along with efficient algorithms. Though most existing imputation methods focus on a moderate amount of missing data, imputation for high missing rates over 80% is still important but challenging. Even with the existence of some works in addressing high missing volume issue, they mostly rely on imputing reference dataset (Complete Datasets for evaluation) after they create artificial missing values and impute it to measure the accuracy of their proposed techniques. So far, the option of imputing high proportions of missing values with no reference comparison dataset (Original Dataset with highly missing values) have been often ignored or overlooked. Therefore, we propose a missing data imputation approach for high volumes of missing values with no reference comparison dataset. The approach makes use of pre-processing measures and breaking the dataset into small continuous non-missing portions then using Multi Criteria Decision-making analysis to select a portion of data which is representative of the entire broken datasets. This portion helps to create reference comparisons and expands the missing dataset through artificial missing-making procedures with different percentages and imputation using different machine learning techniques. This study conducted two experiments using BMI datasets with more than 80% of missing values, derived from the National Child Development Centre (NCDRC) at Sultan Idris Education University (UPSI), Malaysia. The results show that our approach capability in reconstructing datasets with huge missing values. ? 2021 Elsevier Ltd |
References |
Aittokallio, T. (2009). Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Briefings in Bioinformatics, 11(2), 253-264. doi:10.1093/bib/bbp059 Alsalem, M. A., Zaidan, A. A., Zaidan, B. B., Hashim, M., Albahri, O. S., Albahri, A. S., . . . Mohammed, K. I. (2018). Systematic review of an automated multiclass detection and classification system for acute leukaemia in terms of evaluation and benchmarking, open challenges, issues and methodological aspects. Journal of Medical Systems, 42(11) doi:10.1007/s10916-018-1064-9 Aschengrau, A., Gallagher, L. G., Winter, M. R., Vieira, V. M., Janulewicz, P. A., Webster, T. F., & Ozonoff, D. M. (2016). No association between unintentional head injuries and early-life exposure to tetrachloroethylene (PCE)-contaminated drinking water. Journal of Occupational and Environmental Medicine, 58(10), 1040-1045. doi:10.1097/JOM.0000000000000850 Beaulieu-Jones, B. K., Moore, J. H., & The Pooled Resource Open-Access ALS Clinical Trials Consortium. (2017). Missing data imputation in the electronic health record using deeply learned autoencoders. Pacific Symposium on Biocomputing, 0, 207-218. doi:10.1142/9789813207813_0021 Becker, D. R., Miao, A., Duncan, R., & McClelland, M. M. (2014). Behavioral self-regulation and executive function both predict visuomotor skills and early academic achievement. Early Childhood Research Quarterly, 29(4), 411-424. doi:10.1016/j.ecresq.2014.04.014 Bethlehem, J. (2009). Applied survey methods: A statistical perspective. Applied survey methods: A statistical perspective (pp. 1-375) doi:10.1002/9780470494998 Retrieved from www.scopus.com Caemmerer, J. M., & Keith, T. Z. (2015). Longitudinal, reciprocal effects of social skills and achievement from kindergarten to eighth grade. Journal of School Psychology, 53(4), 265-281. doi:10.1016/j.jsp.2015.05.001 Caillault, É. P., Lefebvre, A., & Bigand, A. (2017). Dynamic time warping-based imputation for univariate time series data. Pattern Recognition Letters, Retrieved from www.scopus.com Chen, X., Wei, Z., Li, Z., Liang, J., Cai, Y., & Zhang, B. (2017). Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation. Knowledge-Based Systems, 132, 249-262. doi:10.1016/j.knosys.2017.06.010 Cole, T. J., & Lobstein, T. (2012). Extended international (IOTF) body mass index cut-offs for thinness, overweight and obesity. Pediatric Obesity, 7(4), 284-294. doi:10.1111/j.2047-6310.2012.00064.x Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006). Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59(10), 1087-1091. doi:10.1016/j.jclinepi.2006.01.014 Du, J., Chen, H., & Zhang, W. (2019). A deep learning method for data recovery in sensor networks using effective spatio-temporal correlation data. Sensor Review, 39(2), 208-217. doi:10.1108/SR-02-2018-0039 Eirola, E., Doquire, G., Verleysen, M., & Lendasse, A. (2013). Distance estimation in numerical data sets with missing values. Information Sciences, 240, 115-128. doi:10.1016/j.ins.2013.03.043 Farhangfar, A., Kurgan, L. A., & Pedrycz, W. (2007). A novel framework for imputation of missing values in databases. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans, 37(5), 692-709. doi:10.1109/TSMCA.2007.902631 Fedushko, S., Gregus Ml, M., & Ustyianovych, T. (2019). Medical card data imputation and patient psychological and behavioral profile construction. Paper presented at the Procedia Computer Science, , 160 354-361. doi:10.1016/j.procs.2019.11.080 Retrieved from www.scopus.com Flouri, E., Midouhas, E., & Joshi, H. (2014). The role of urban neighbourhood green space in children's emotional and behavioural resilience. Journal of Environmental Psychology, 40, 179-186. doi:10.1016/j.jenvp.2014.06.007 Garciarena, U., & Santana, R. (2017). An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Systems with Applications, 89, 52-65. doi:10.1016/j.eswa.2017.07.026 Gheyas, I. A., & Smith, L. S. (2010). A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing, 73(16-18), 3039-3065. doi:10.1016/j.neucom.2010.06.021 Goelman, H., Zdaniuk, B., Boyce, W. T., Armstrong, J. M., & Essex, M. J. (2014). Maternal mental health, child care quality, and children's behavior. Journal of Applied Developmental Psychology, 35(4), 347-356. doi:10.1016/j.appdev.2014.05.003 Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? some practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206-213. doi:10.1007/s11121-007-0070-9 Harel, O., & Zhou, X. -. (2007). Multiple imputation: Review of theory, implementation and software. Statistics in Medicine, 26(16), 3057-3077. doi:10.1002/sim.2787 Hernández-Pereira, E. M., Álvarez-Estévez, D., & Moret-Bonillo, V. (2015). Automatic classification of respiratory patterns involving missing data imputation techniques. Biosystems Engineering, 138, 65-76. doi:10.1016/j.biosystemseng.2015.06.011 Janik, M., Bossew, P., & Kurihara, O. (2018). Machine learning methods as a tool to analyse incomplete or irregularly sampled radon time series data. Science of the Total Environment, 630, 1155-1167. doi:10.1016/j.scitotenv.2018.02.233 Janssen, K. J. M., Donders, A. R. T., Harrell Jr., F. E., Vergouwe, Y., Chen, Q., Grobbee, D. E., & Moons, K. G. M. (2010). Missing covariate data in medical research: To impute is better than to ignore. Journal of Clinical Epidemiology, 63(7), 721-727. doi:10.1016/j.jclinepi.2009.12.008 Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50(2), 105-115. doi:10.1016/j.artmed.2010.05.002 Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50(2), 105-115. doi:10.1016/j.artmed.2010.05.002 Jiang, Y., Chen, S., McGuire, D., Chen, F., Liu, M., Iacono, W. G., . . . Liu, D. J. (2018). Proper conditional analysis in the presence of missing data: Application to large scale meta-analysis of tobacco use phenotypes. PLoS Genetics, 14(7) doi:10.1371/journal.pgen.1007452 Kapelner, A., & Bleich, J. (2015). Prediction with missing data via bayesian additive regression trees. Canadian Journal of Statistics, 43(2), 224-239. doi:10.1002/cjs.11248 Kiasari, M. A., Jang, G. -., & Lee, M. (2017). Novel iterative approach using generative and discriminative models for classification with missing features. Neurocomputing, 225, 23-30. doi:10.1016/j.neucom.2016.11.015 Kremer, K. P., Flower, A., Huang, J., & Vaughn, M. G. (2016). Behavior problems and children's academic achievement: A test of growth-curve models with gender and racial differences. Children and Youth Services Review, 67, 95-104. doi:10.1016/j.childyouth.2016.06.003 Lê, F., Diez Roux, A., & Morgenstern, H. (2013). Effects of child and adolescent health on educational progress. Social Science and Medicine, 76(1), 57-66. doi:10.1016/j.socscimed.2012.10.005 Lee, S. J., Altschul, I., & Gershoff, E. T. (2015). Wait until your father gets home? mother's and fathers' spanking and development of child aggression. Children and Youth Services Review, 52, 158-166. doi:10.1016/j.childyouth.2014.11.006 Li, S. C. -., Jiang, B., & Marlin, B. (2019). Misgan: Learning from incomplete data with generative adversarial networks. ArXiv Preprint, Retrieved from www.scopus.com Li, Y., & Parker, L. E. (2014). Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks. Information Fusion, 15(1), 64-79. doi:10.1016/j.inffus.2012.08.007 Li, Z., Sharaf, M. A., Sitbon, L., Sadiq, S., Indulska, M., & Zhou, X. (2014). A web-based approach to data imputation. World Wide Web, 17(5), 873-897. doi:10.1007/s11280-013-0263-z Liew, A. W. -., Law, N. -., & Yan, H. (2011). Missing value imputation for gene expression data: Computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5), 498-513. doi:10.1093/bib/bbq080 Lin, D., Sun, H., & Zhang, X. (2016). Bidirectional relationship between visual spatial skill and chinese character reading in chinese kindergartners: A cross-lagged analysis. Contemporary Educational Psychology, 46, 94-100. doi:10.1016/j.cedpsych.2016.04.008 Lin, W. -., & Tsai, C. -. (2020). Missing value imputation: A review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53(2), 1487-1509. doi:10.1007/s10462-019-09709-4 Lin, W. -., & Tsai, C. -. J. A. I. R. (2019). , 1-23. Retrieved from www.scopus.com Little, R. J., D'Agostino, R., Cohen, M. L., Dickersin, K., Emerson, S. S., Farrar, J. T., . . . Stern, H. (2012). The prevention and treatment of missing data in clinical trials. New England Journal of Medicine, 367(14), 1355-1360. doi:10.1056/NEJMsr1203730 Little, R. J. A., & Rubin, D. B. (1987). Statistical Analysis with Missing Data, Retrieved from www.scopus.com Liu, Y., Dillon, T., Yu, W., Rahayu, W., & Mostafa, F. (2020). Missing value imputation for industrial IoT sensor data with large gaps. IEEE Internet of Things Journal, 7(8), 6855-6867. doi:10.1109/JIOT.2020.2970467 McCormick, M. P., O'Connor, E. E., & Barnes, S. P. (2016). Mother-child attachment styles and math and reading skills in middle childhood: The mediating role of children's exploration and engagement. Early Childhood Research Quarterly, 36, 295-306. doi:10.1016/j.ecresq.2016.01.011 McCormick, M. P., O'Connor, E. E., Cappella, E., & McClowry, S. G. (2013). Teacher-child relationships and academic achievement: A multilevel propensity score model approach. Journal of School Psychology, 51(5), 611-624. doi:10.1016/j.jsp.2013.05.001 McDonald, C. M., Olofin, I., Flaxman, S., Fawzi, W. W., Spiegelman, D., Caulfield, L. E., . . . Danaei, G. (2013). The effect of multiple anthropometric deficits on child mortality: Meta-analysis of individual data in 10 prospective studies from developing countries. American Journal of Clinical Nutrition, 97(4), 896-901. doi:10.3945/ajcn.112.047639 Mesquita, D. P. P., Gomes, J. P. P., Souza Junior, A. H., & Nobre, J. S. (2017). Euclidean distance estimation in incomplete datasets. Neurocomputing, 248, 11-18. doi:10.1016/j.neucom.2016.12.081 Miller, M. R., Müller, U., Giesbrecht, G. F., Carpendale, J. I., & Kerns, K. A. (2013). The contribution of executive function and social understanding to preschoolers' letter and math skills. Cognitive Development, 28(4), 331-349. doi:10.1016/j.cogdev.2012.10.005 Mohamed, M. H., Abdel-rahiem, A. H., & Abdelsamea, M. M. (2014). Scalable algorithms for missing value imputation. Int.J.Computer Applications, 87(11), 35-42. Retrieved from www.scopus.com Nagarajan, G., & Dhinesh Babu, L. D. (2019). A hybrid of whale optimization and late acceptance hill climbing based imputation to enhance classification performance in electronic health records. Journal of Biomedical Informatics, 94 doi:10.1016/j.jbi.2019.103190 Nagy, K. (2020). Term structure estimation with missing data: Application for emerging markets. Quarterly Review of Economics and Finance, 75, 347-360. doi:10.1016/j.qref.2019.04.002 Önüt, S., Kara, S. S., & Işik, E. (2009). Long term supplier selection using a combined fuzzy MCDM approach: A case study for a telecommunication company. Expert Systems with Applications, 36(2 PART 2), 3887-3895. doi:10.1016/j.eswa.2008.02.045 Opricovic, S., & Tzeng, G. -. (2004). Compromise solution by MCDM methods: A comparative analysis of VIKOR and TOPSIS. European Journal of Operational Research, 156(2), 445-455. doi:10.1016/S0377-2217(03)00020-1 Paradis, A. D., Fitzmaurice, G. M., Koenen, K. C., & Buka, S. L. (2015). A prospective investigation of neurodevelopmental risk factors for adult antisocial behavior combining official arrest records and self-reports. Journal of Psychiatric Research, 68, 363-370. doi:10.1016/j.jpsychires.2015.04.030 Price, M., Higa-McMillan, C., Kim, S., & Frueh, B. C. (2013). Trauma experience in children and adolescents: An assessment of the effects of trauma type and role of interpersonal proximity. Journal of Anxiety Disorders, 27(7), 652-660. doi:10.1016/j.janxdis.2013.07.009 Purwar, A., & Singh, S. K. (2015). Hybrid prediction model with missing value imputation for medical data. Expert Systems with Applications, 42(13), 5621-5631. doi:10.1016/j.eswa.2015.02.050 Qin, Y., Zhang, S., Zhu, X., Zhang, J., & Zhang, C. (2009). POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications, 36(2 PART 2), 2794-2804. doi:10.1016/j.eswa.2008.01.059 Rahman, G., & Islam, Z. (2011). A decision tree-based missing value imputation technique for data pre-processing[C]. Proceedings of the Ninth Australasian Data Mining Conference, 121, 41-50. Retrieved from www.scopus.com Raymond, M. R., & Roberts, D. M. (1987). A comparison of methods for treating incomplete data in selection research. Educational and Psychological Measurement, 47(1), 13-26. Retrieved from www.scopus.com Razavi-Far, R., Cheng, B., Saif, M., & Ahmadi, M. (2020). Similarity-learning information-fusion schemes for missing data imputation. Knowledge-Based Systems, 187 doi:10.1016/j.knosys.2019.06.013 Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst's perspective. Multivariate Behavioral Research, 33(4), 545-571. doi:10.1207/s15327906mbr3304_5 Shah, R., Mullany, L. C., Darmstadt, G. L., Mannan, I., Rahman, S. M., Talukder, R. R., . . . Baqui, A. H. (2014). Incidence and risk factors of preterm birth in a rural bangladeshi cohort. BMC Pediatrics, 14(1) doi:10.1186/1471-2431-14-112 Shah, R., Mullany, L. C., Darmstadt, G. L., Talukder, R. R., Rahman, S. M., Mannan, I., . . . on behalf of the ProjAHNMo Study Group in Bangladesh. (2014). Determinants and pattern of care seeking for preterm newborns in a rural bangladeshi cohort. BMC Health Services Research, 14(1) doi:10.1186/1472-6963-14-417 Shah, R., Mullany, L. C., Darmstadt, G. L., Talukder, R. R., Rahman, S. M., Mannan, I., . . . ProjAHNMo Study Group in Bangladesh. (2014). Neonatal mortality risks among preterm births in a rural bangladeshi cohort. Paediatric and Perinatal Epidemiology, 28(6), 510-520. doi:10.1111/ppe.12145 Shiwaku, K., Anuurad, E., Enkhmaa, B., Kitajima, K., & Yamane, Y. (2004). Appropriate BMI for asian populations. Lancet, 363(9403), 157-163. Retrieved from www.scopus.com Staff, J., Maggs, J. L., Cundiff, K., & Evans-Polce, R. J. (2016). Childhood cigarette and alcohol use: Negative links with adjustment. Addictive Behaviors, 62, 122-128. doi:10.1016/j.addbeh.2016.06.022 Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., . . . Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ (Online), 339(7713), 157-160. doi:10.1136/bmj.b2393 Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890-908. doi:10.1109/32.962560 Sunny, B. S., Elze, M., Chihana, M., Gondwe, L., Crampin, A. C., Munkhondya, M., . . . Glynn, J. R. (2017). Failing to progress or progressing to fail? age-for-grade heterogeneity and grade repetition in primary schools in karonga district, northern malawi. International Journal of Educational Development, 52, 68-80. doi:10.1016/j.ijedudev.2016.10.004 Tamayo, C., Manlhiot, C., Patterson, K., Lalani, S., & McCrindle, B. W. (2015). Longitudinal evaluation of the prevalence of Overweight/Obesity in children with congenital heart disease. Canadian Journal of Cardiology, 31(2), 117-123. doi:10.1016/j.cjca.2014.08.024 Tharayil, J. J., Chiang, S., Moss, R., Stern, J. M., Theodore, W. H., & Goldenholz, D. M. (2017). A big data approach to the development of mixed-effects models for seizure count data. Epilepsia, 58(5), 835-844. doi:10.1111/epi.13727 Vandecandelaere, M., Vansteelandt, S., De Fraine, B., & Van Damme, J. (2016). The effects of early grade retention: Effect modification by prior achievement and age. Journal of School Psychology, 54, 77-93. doi:10.1016/j.jsp.2015.10.004 Vazifehdan, M., Moattar, M. H., & Jalali, M. (2019). A hybrid bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recurrence prediction. Journal of King Saud University - Computer and Information Sciences, 31(2), 175-184. doi:10.1016/j.jksuci.2018.01.002 Velasco-Gallego, C., & Lazakis, I. (2020). Real-time data-driven missing data imputation for short-term sensor data of marine systems. A comparative study. Ocean Engineering, 218 doi:10.1016/j.oceaneng.2020.108261 Yoon, S. (2017). Child maltreatment characteristics as predictors of heterogeneity in internalizing symptom trajectories among children in the child welfare system. Child Abuse and Neglect, 72, 247-257. doi:10.1016/j.chiabu.2017.08.022 Yoon, S., & Sull, S. (2020). Gamin: Generative adversarial multiple imputation network for highly missing data. Paper presented at the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 8453-8461. doi:10.1109/CVPR42600.2020.00848 Retrieved from www.scopus.com Zhang, S. (2012). Nearest neighbor selection for iteratively kNN imputation. Journal of Systems and Software, 85(11), 2541-2552. doi:10.1016/j.jss.2012.05.073 Zhang, W., Yang, Y., & Wang, Q. (2011). Handling missing data in software effort prediction with naive bayes and em algorithm. Paper presented at the ACM International Conference Proceeding Series, doi:10.1145/2020390.2020394 Retrieved from www.scopus.com Zhou, X. -. (2020). Challenges and strategies in analysis of missing data. Biostatistics and Epidemiology, 4(1), 15-23. doi:10.1080/24709360.2018.1469810 Zhou, X. -., Zhou, C., Lui, D., & Ding, X. (2014). Missing data concepts and motivating examples. Applied Missing Data Analysis in the Health Sciences, Retrieved from www.scopus.com Zhu, X., Zhang, S., Jin, Z., & Zhang, Z. (2010). 23 Retrieved from www.scopus.com |
This material may be protected under Copyright Act which governs the making of photocopies or reproductions of copyrighted materials. You may use the digitized material for private study, scholarship, or research. |