Feature extraction using regular expression in detecting proper noun for Malay news articles based on KNN algorithm

Farid Morsidi

QR Code Link :
Type :	Article
Subject :	Q Science (General)
ISSN :	1112-9867
Main Author :	Farid Morsidi
Additional Authors :	Suliana Sulaiman Rohaizah Abdul Wahid
Title :	Feature extraction using regular expression in detecting proper noun for Malay news articles based on KNN algorithm
Hits :	680

Year of Publication :	2017
PDF Full Text :	You have no permission to view this item.

Abstract :

The identification of proper nouns from text aims to classify named entities according to their respective groupings, an aspect included in Named Entity Recognition (NER). Proper noun disambiguation can adversely affect morphological analysis, a vital trait to improve the corpus availability via classification and new word assimilation. The occurrences of proper nouns can be annotated from the text resources using separate entity mapping from their fragments. This research was carried out to examine the impact of regex on text pattern identification sequence that queried and acquired proper nouns from a collection of unannotated Malay language news articles. This basis study envisions several techniques to improve text entities precision and accuracy, such as pre-processing and data clustering. The results showed that the F-scores of the output tested on the unannotated news dataset were between 30% and 60%.

References

[1] Srivastava, A. N., & Sahami, M. (Eds.), Text mining: classification, clustering, and applications. Boca Raton: Chapman and Hall/CRC, 2009. [2] Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. Open information extraction from the web. International Joint Conference on Artificial Intelligence (IJCAI), Vol. 7, 2007, 2670-2676. [3] Suwarningsih, W., Supriana, I., & Purwarianti, A. ImNER Indonesian medical named entity recognition. Proceedings of 2014 2nd International Conference on Technology, Informatics, Management, Engineering and Environment (TIME-E 2014), 2015, 184–188, doi: 10.1007/11563983_7 [4] Al-Shoukry, S., Omar, N. (2015). Proper nouns recognition in arabic crime text using machine learning approach. Journal of Theoretical and Applied Information Technology, 2015, 79 (3): 506-513 [5] Elyasir, A. M. H., & Anbananthen, K. S. M. Comparison between bag of words and word sense disambiguation. International Conference on Advanced Computer Science and Electronics Information (ICACSEI 2013). 2013, 413-417 [6] Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., & Jagadish, H. V. Regular expression learning for information extraction. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008, 21-30 [7] Sari, Y., Hassan, M. F., & Zamin, N. Rule-based pattern extractor and named entity recognition: a hybrid approach. Proceedings 2010 International Symposium on Information Technology-Engineering Technology, ITSim’10. 2010, Vol. 2, 563–568, doi: 10.1109/ITSIM.2010.5561392 [8] Ojo, A., & Adeyemo, A. B. (2012). Framework for knowledge discovery from journal articles using text mining techniques. African Journal of Computing & ICT. 2012, 5(2), 35–44 [9] Zamin, N., & Oxley, A. Building a corpus-derived gazetteer for named entity recognition. International Conference on Software Engineering and Computer Systems. 2011, 73-80, doi:10.1007/978-3-642-22191-0_6 [10] Chapman, C., & Stolee, K. T. Exploring regular expression usage and context in python. Proceedings of the 25th International Symposium on Software Testing and Analysis - ISSTA 2016. 2016, 282–293, doi: 10.1145/2931037.2931073 [11] Ritter, A., Clark, S., Mausam, & Etzioni, O. Named Entity Recognition in Tweets: An Experimental Study. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 2011, 1524–1534 [12] Mohd Don, Z. Processing Natural Malay Texts: A Data-Driven Approach, Trames. 2010, 14(1), 90–103, doi: 10.3176/tr.2010.1.06 [13] Ramli, I., Jamil, N., Seman, N., & Ardi, N. An Improved Syllabification for a Better Malay Language Text-to-Speech Synthesis (TTS). 2015 IEEE International Symposium on Robotics and Intelligent Sensors. 2015, 76(Iris): 417–424 [14] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research. 2011, Vol 12: 2493–2537 [15] Zhang, X., & LeCun, Y. Text Understanding from Scratch. Learning; Computation and Language. 2015, http://arxiv.org/abs/1502.01710 [16] Althobaiti, M., Kruschwitz, U., Poesio, M. A Semi-Supervised Learning Approach to Arabic Named Entity Recognition. IEEE Proceedings of Recent Advances in Natural Language Processing. 2013, 32-40. [17] Alfred, R., Chin Leong, L., Kim On, C., Anthony, P. Malay Named Entity Recognition Based on Rule-Based Approach. International Journal of Machine Learning and Computing. 2014, 4 (3): 300-306. [18] Suakkaphong, N., Zhang, Z., & Chen, H. Disease Named Entity Recognition Using SemiSupervised Learning and Conditional Random Fields. Journal of the American Society for Information Science and Technology. 2013, 14(4), 90–103, doi: 10.1002/asi.21488 [19] Nicholson, J., & Baldwin, T. Learning Count Classifier Preferences of Malay Nouns. Proceedings of the 2008 Australasian Language Technology Workshop. 2008, 115–123. [20] Ananiadou, S., Pyysalo, S., Tsuji, J., Kell, D.B. Event Extraction for Systems Biology by Text Mining the Literature. Journal of Trends in Biotechnology. 2010, 28(7), 381-390, doi:10.1016/j.tibtech.2010.04.005 [21] Matousek, V. Text, Speech, and Dialogue: 18th International Conference, TSD 2015 Pilsen, Czech Republic, September 14-17, 2015 Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol 9302, 243-251, doi: 10.1007/978-3-319-24033-6 [22] Nadeau, D., & Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investigationes. 2007, 30(1), 3-26, doi: 10.1075/li.30.1.03 [23] Wibawa, A. S., & Purwarianti, A. Indonesian Named-Entity Recognition For 15 Classes Using Ensemble Supervised Learning. Procedia Computer Science. 2016, Vol 81: 221–228, doi: 10.1016/j.procs.2016.04.053 [24] Abu Bakar, J., Omar, K., Nasrudin, M. F., Murah, M. Z., Al-shoukry, S., Omar, N., Klose, A. Processing Natural Malay Texts: A Data-Driven Approach. Journal of Neurocomputing. 2013, 79(3), 2670–2676. [25] Fadzli, S. A., Norsalehen, A. K., Syarilla, I. A., Hasni, H., & Dhalila, M. S. S. Simple rules malay stemmer. The International Conference on Informatics and Applications (ICIA). 2012,28-35. [26] Tran, V. C., Hwang, D., & Jung, J. J. Semi-supervised Approach Based on Co-occurrence Coefficient for Named Entity Recognition on Twitter. Information and Computer Science (NICS), 2nd National Foundation for Science and Technology Development Conference. 2015, 141–146. doi: 10.1109/NICS.2015.7302179

This material may be protected under Copyright Act which governs the making of photocopies or reproductions of copyrighted materials.
You may use the digitized material for private study, scholarship, or research.

Back to search page