Proper noun detection using regex algorithm and rules for malay named entity recognition

Farid Morsidi

QR Code Link :
Type :	Thesis
Subject :	QA Mathematics
Main Author :	Farid Morsidi
Title :	Proper noun detection using regex algorithm and rules for malay named entity recognition
Hits :	1263

Place of Production :	Tanjong Malim
Publisher :	Fakulti Seni, Komputeran dan Industri Kreatif
Year of Publication :	2018
Corporate Name :	Universiti Pendidikan Sultan Idris
PDF Guest :	Click to view PDF file
PDF Full Text :	You have no permission to view this item.

Abstract : Universiti Pendidikan Sultan Idris

This study was aimed to develop a Malay proper noun detection method to cluster and classify named entity categories, particularly for major important classes such as person, location, organization, and miscellaneous for Malay newspaper corpus. Regular Expression pattern identification (regex) algorithm and rule were introduced in this study to overcome the limitation of dictionary and gazetteer. Two visualization techniques namely as Decision Tree and Term Document Matrix had been used to evaluate the efficiency of the method. The result obtained 74% of accuracy during the generation of decision tree. Visualization for term document matrix achieves a maximized value of 9.8007403, 9.8718517, and 9.9890683 for Astro Awani, Berita Harian, and Bernama dataset respectively. As a conclusion, the regex algorithm could indicate the presence of Malay proper noun, thus making it an appropriate method for extraction tool to cluster and classify Malay proper noun. The study implicates that the use of Malay proper noun detection method can increase the effectiveness in named entity recognition and beneficial to improve document retrieval for Malay language.

References

Abdallah, S., Shaalan, K., & Shoaib, M. (2012). Integrating rule-based system

with classification for arabic named entity recognition. In Lecture Notes in Computer

Science (including subseries Lecture Notes in Artificial Intelligence and Lecture

Notes in Bioinformatics) (Vol. 7181 LNCS, pp. 311–322).

http://doi.org/10.1007/978-3-642-28604- 9_26

AbdelRahman, S., Elarnaoty, M., & Magdy, M. (2010). Integrated Machine Learning

Techniques for Arabic Named Entity Recognition. International Journal of Computer Science, 7(4),

27–36. Retrieved from http://ijcsi.org/papers/IJCSI-Vol-7-Issue-4-No-3.pdf#page=41

Abdul-hamid, A., & Darwish, K. (2010). Simplified Feature Set for Arabic Named Entity

Recognition. Proceedings of the 2010 Named Entities Workshop, (July), 110–115. Retrieved from

http://www.aclweb.org/anthology/W10-2417

Abdullah, M., & Ahmad, F. (2009). Rules frequency order stemmer for malay language. … International

Journal of …, 9(2), 433–438. Retrieved from

http://paper.ijcsns.org/07_book/200902/20090258.pdf

Abedinpourshotorban, H., Hasan, S., Shamsuddin, S. M., & As’Sahra, N. F. (2016). A

differential-based harmony search algorithm for the optimization of continuous problems.

Expert Systems with Applications, 62, 317–332. http://doi.org/10.1016/j.eswa.2016.05.013

Aboaoga, M., & Aziz, M. J. A. (2013). Arabic person names recognition by using a

rule based approach. Journal of Computer Science, 9(7),

922–927. http://doi.org/10.3844/jcssp.2013.922.927

Abu Bakar, J., Omar, K., Nasrudin, M. F., & Murah, M. Z. (2013). Part-of-Speech for Old Malay

Manuscript Corpus: A Review. In Communications in Computer and Information Science (Vol.

378 CCIS, pp. 53–66). http://doi.org/10.1007/978-3-642-40567-9_5

Abu Bakar, J., Omar, K., Nasrudin, M. F., Murah, M. Z., Al-shoukry, S., Omar, N., … Klose,

A. (2013). Processing natural malay texts: A data-driven approach. Neurocomputing, 79(3),

2670–2676. http://doi.org/10.3176/tr.2010.1.06

Agarwal, S. K., Shah, S., & Kumar, R. (2015). Classification of mental tasks from EEG data using

backtracking search optimization based neural classifier. Neurocomputing, 166, 397– 403.

http://doi.org/10.1016/j.neucom.2015.03.041

Aggarwal, C., & Zhao, P. (2013). Towards graphical models for text processing. Knowledge and

Information Systems, 36(1), 1–21. http://doi.org/10.1007/s10115-012-0552-3

Ahmad, Z. H., & Khalifa, O. (2008). Towards designing a high intelligibility rule

based standard Malay text-to-speech synthesis system. Proceedings of the International Conference

on Computer and Communication Engineering 2008, ICCCE08: Global Links for Human

Development, 89–94. http://doi.org/10.1109/ICCCE.2008.4580574

Ahmed, Z. (2013). Named Entity Recognition and Question Answering Using Word Vectors and

Clustering.

Akbari, R., Hedayatzadeh, R., Ziarati, K., & Hassanizadeh, B. (2012). A multi-objective

artificial bee colony algorithm. Swarm and Evolutionary Computation, 2, 39–52.

http://doi.org/10.1016/j.swevo.2011.08.001

Alfred, R. (2016). Intelligent Information and Database Systems. In ACIIDS 2016, Part II (pp.

447–457). http://doi.org/10.1007/978-3-642-12145-6

Alfred, R., Leong, L. C., On, C. K., & Anthony, P. (2014). Malay Named Entity Recognition

Based on Rule-Based Approach. International Journal of Machine Learning and Computing,

4(3), 300–306. http://doi.org/10.7763/IJMLC.2014.V4.428

Aljoumaa, H. (2012). Development of a Self-Learning Approach Applied to Pattern

Recognition and Fuzzy Control, (September 2012), 127.

Al-Moslmi, T., Gaber, S., Al-Shabi, A., Albared, M., & Omar, N. (2015). Feature Selection

Methods Effects on Machine Learning Approaches in Malay Sentiment Analysis, (October),

2–5.

Alshalabi, H., Tiun, S., Omar, N., & Albared, M. (2013). Experiments on the Use of Feature

Selection and Machine Learning Methods in Automatic Malay Text Categorization.

International Conference on Electrical Engineering and Informatics (ICEEI 2013), 11(Iceei),

748–754. http://doi.org/10.1016/j.protcy.2013.12.254

Al-shammaa, M., & Abbod, M. F. (2015). Automatic Generation of Fuzzy Classification

Rules from Data.

Al-shoukry, S., & Omar, N. (2015). Proper Nouns Recognition in Arabic Crime Text Using

Machine Learning Approach, 79(3), 506–513.

Althobaiti, M., Kruschwitz, U., & Poesio, M. (2015). Combining Minimally-supervised

Methods for Arabic Named Entity Recognition. Transactions of the Association for

Computational Linguistics, 3, 243–255. Retrieved from

https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/564

Althobaiti, M., Kruschwitz, U., & Poesio, M. (2013). A Semi-supervised Learning Approach

to Arabic Named Entity Recognition, (September), 32–40.

http://doi.org/10.1177/0165551513502417

Althobaiti, M., Kruschwitz, U., & Poesio, M. (2014). Automatic Creation of Arabic Named

Entity Annotated Corpus Using Wikipedia. Proceedings of the Student Research Workshop at

the 14th Conference of the European Chapter of the Association for Computational

Linguistics, 106–115. Retrieved from http://www.aclweb.org/anthology/E14-3012

Ananiadou, S., & McNaught, J. (2006). Text Mining for Biology and Biomedicine. Boston:

Artech House.

Ananiadou, S., Pyysalo, S., Tsujii, J., & Kell, D. B. (2010). Event extraction for systems

biology by text mining the literature. Trends in Biotechnology.

http://doi.org/10.1016/j.tibtech.2010.04.005

Ando, R. R. K., & Zhang, T. (2005). A high-performance semi-supervised learning method

for text chunking. Proceedings of the 43rd Annual Meeting on Association for Computational

Linguistics, (June), 1–9. http://doi.org/10.3115/1219840.1219841

Baharudin, B., Lee, L. H., & Khan, K. (2010). A Review of Machine Learning Algorithms for

Text-Documents Classification. Journal of Advances in Information Technology, 1(1), 4–20.

http://doi.org/10.4304/jait.1.1.4-20

Bali, R.-M., Chua, C. C., & Ng, P. K. (2007). Identifying and Classifying Unknown Words In

Malay Texts. The Seventh International Symposium on Natural Language Processing

(SNLP2007), 493–498. Retrieved from

http://eprints.usm.my/9442/1/Identifying_and_classifying_unknown_words_in_Malay_texts.p

df%5Cnhttp://eprints.usm.my/9442/

Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open

Information Extraction from the Web. Proceedings of IJCAI-07, the International Joint

Conference on Artificial Intelligence, 2670–2676. http://doi.org/10.1145/1409360.1409378

Bawane, M. S., & Gadicha, P. V. B. (n.d.). Analysing the result of GRIAS framework by

using Precision , Recall and F-measure, 24–30.

Benajiba, Y., Diab, M., & Rosso, P. (2008). Arabic named entity recognition using optimized

feature sets. EMNLP ’08 Proceedings of the Conference on Empirical Methods in Natural

Language Processing, (October), 284–293. Retrieved from

http://dl.acm.org/citation.cfm?id=1613715.1613755

Benajiba, Y., & Rosso, P. (2008). Arabic Named Entity Recognition using Conditional

Random Fields. Proc. of Workshop on HLT & NLP within the Arabic World, LREC. Vol. 8.,

143–153. Retrieved from

http://www.dsic.upv.es/~prosso/resources/BenajibaRosso_LREC08.pdf

Benajiba, Y., Rosso, P., & BenedíRuiz, J. (2007). ANERsys: an Arabic named entity

recognition system based on maximum entropy. Gelbukh, A. (Ed.) CICLing 2007. LNCS,

143–153. Retrieved from http://www.springerlink.com/index/5g6n298843878701.pdf

Bezdek, J. C. (1993). A Physical Interpretation of Fuzzy ISODATA. Readings in Fuzzy Sets

for Intelligent Systems, (November), 615–616. http://doi.org/10.1109/TSMC.1976.4309506

Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M. a, Maynard, D., & Aswani, N.

(2013). TwitIE : An Open-Source Information Extraction Pipeline for Microblog Text. In

Proceedings of Recent Advances in Natural Language Processing (pp. 83–90). Retrieved

from https://www.aclweb.org/anthology/R/R13/R13-1011.pdf

Brief, T. (2005). Agreement , the F-Measure , and Reliability in Information Retrieval, 296–

298. http://doi.org/10.1197/jamia.M1733.Informatics

Brill, E. (2000). Pattern-based disambiguation for natural language processing. Annual

Meeting of the ACL, 1. Retrieved from http://portal.acm.org/citation.cfm?id=1117795

Bsoul, Q., Salim, J., & Zakaria, L. Q. (2013). An Intelligent Document Clustering Approach

to Detect Crime Patterns. Procedia Technology, 11(Iceei), 1181–1187.

http://doi.org/10.1016/j.protcy.2013.12.311

Cao, T. H., Tang, T. M., & Chau, C. K. (2012). Text Clustering with Named Entities: A

Model, Experimentation and Realization. Intelligent Systems Reference Library, 23, 267–287.

http://doi.org/10.1007/978-3-642-23166-7_10

Carlson, A., & Betteridge, J. (2010). Coupled semi-supervised learning for information

extraction. Proceedings of the Third ACM International Conference on Web Search and Data

Mining (2010), 101–110. http://doi.org/10.1145/1718487.1718501

Chapman, C. A. (2016). Usage and refactoring studies of python regular expressions by.

Graduate Theses and Dissertations. This, Paper 1513.

Chapman, C., & Stolee, K. T. (2016). Exploring regular expression usage and context in

Python. In Proceedings of the 25th International Symposium on Software Testing and

Analysis - ISSTA 2016 (pp. 282–293). http://doi.org/10.1145/2931037.2931073

Chart, G., Algorithm, G., Tun, U., & Onn, H. (2012). Single Disciplinary Project Application Form

Fundamental Research Grant Scheme (FRGS), (i), 1–16.

http://doi.org/10.1155/2013/782519.(ISI-Q2).

Che, W., Wang, M., Manning, C. D., & Liu, T. (2013). Named Entity Recognition with

Bilingual Constraints. Proceedings of the 2013 Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies, (June), 52–

62. Retrieved from http://www.aclweb.org/anthology/N13-1006

Chen, K., Dong, X., Zhu, J., & Shen, B. (2016). Building a domain knowledge base

from wikipedia: A semi-supervised approach. Proceedings of the International Conference on

Software Engineering and Knowledge Engineering, SEKE,

2016–Janua. http://doi.org/10.18293/SEKE2016-051

Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., & Vaithyanathan, S. (2010).

Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks. Proceedings of the

2010 Conference on Empirical Methods in Natural Language Processing, (October), 1002–1012.

Retrieved from http://portal.acm.org/citation.cfm?id=1870756

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P.

(2011). Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research,

12(Aug), 2493–2537. http://doi.org/10.1145/2347736.2347755

Derczynski, L., Maynard, D., Rizzo, G., & Erp, M. Van. (n.d.). Analysis of Named Entity

Recognition and Linking for Tweets, 1–35.

Diab, M. (2009). Second Generation AMIRA Tools for Arabic Processing?: Fast and Robust

Tokenization, POS tagging, and Base Phrase Chunking. Proceedings of the Second

International Conference on Arabic Language Resources and Tools, 285–288. Retrieved from

http://www.elda.org/medar-conference/pdf/56.pdf

Duan, H., Zheng, Y., & Random, C. (2011). A Study on Features of the CRFs-based Chinese.

International Journal of Advanced Intelligence, 3(2), 287–294.

Dumais, S., & Chen, H. (2000). Hierarchical classification of Web content. SIGIR ’00:

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval, 256–263. http://doi.org/10.1145/345508.345593

Ek, T., Kirkegaard, C., Jonsson, H., & Nugues, P. (2011). Named entity recognition for short text

messages. Procedia - Social and Behavioral Sciences, 27(September), 178–187.

http://doi.org/10.1016/j.sbspro.2011.10.596

Ekbal, A., & Saha, S. (2011). A multiobjective simulated annealing approach for classifier

ensemble: Named entity recognition in Indian languages as case studies. Expert Systems with

Applications, 38(12), 14760–14772. http://doi.org/10.1016/j.eswa.2011.05.004

Ekbal, A., Saha, S., & Sikdar, U. K. (2012). Multiobjective Optimization for Biomedical

Named Entity Recognition and Classification. Procedia Technology, 6(0), 206–213.

http://doi.org/http://dx.doi.org/10.1016/j.protcy.2012.10.025

Elsayed, H., & Elghazaly, T. (2015). A Named Entities Recognition System for Modern

Standard Arabic using Rule-Based Approach. 2015 First International Conference on Arabic

Computational Linguistics (ACLing), 12(1), 51–54. http://doi.org/10.1109/ACLing.2015.14

Elsebai, a, Meziane, F., & Belkredim, F. (2009). A Rule Based Persons Names Arabic

Extraction System. Communications of the IBIMA, 11(August), 53–59. Retrieved from

http://usir.salford.ac.uk/2206/

Elyasir, A. M. H., Sonai, K., & Anbananthen, M. (2013). Comparison between Bag of Words

and Word Sense Disambiguation, (Icacsei), 413–417.

Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S.,… Yates,

A. (2005). Unsupervised named-entity extraction from the Web: An experimental study.

Artificial Intelligence, 165(1), 91–134. http://doi.org/10.1016/j.artint.2005.03.001

Fadzli, S. A., Norsalehen, A. K., Syarilla, I. A., Hasni, H., & Dhalila, M. S. S. (2012). Simple

rules malay stemmer. The International Conference on Informatics and Applications

(ICIA2012), 28–35. Retrieved from http://sdiwc.net/digitallibrary/

download.php?id=00000187.pdf

Fuchs, G., Stange, H., Samiei, A., Andrienko, G., & Andrienko, N. (2015). A semi-supervised

method for topic extraction from micro postings. Information Technology, 57(1), 49–56.

http://doi.org/10.1515/itit-2014-1078

Fung, P., Fung, P., Cheung, P., & Cheung, P. (2004). Mining Very-Non-Parallel Corpora:

Parallel Sentence and Lexicon Extraction via Bootstrapping and EM. EMNLP 2004 -

Conference on Empirical Methods in Natural Language Processing, 57–63. Retrieved from

http://www.aclweb.org/anthology-new/W/W04/W04-3208.pdf

Gosselin, L., Tye-Gingras, M., & Mathieu-Potvin, F. (2009). Review of utilization of genetic

algorithms in heat transfer problems. International Journal of Heat and Mass Transfer.

Elsevier Ltd. http://doi.org/10.1016/j.ijheatmasstransfer.2008.11.015

Goyvaerts, J., & Levithan, S. (2012). Regular Expressions Cookbook, 612.

http://doi.org/9780596802837

Gunawan, Purnama, I. K. E., & Hariadi, M. (2015). Supervised learning Indonesian gloss

acquisition. IAENG International Journal of Computer Science, 42(4), 337–346.

Hassan, M., Nazlia, O., & Mohd Juzaiddin, A. A. (2015). Malay Part of Speech Tagger : A

Comparative Study on Tagging Tools. Asia-Pacific Journal of Information Technology and

Multimedia, 4(1), 11–23. http://doi.org/10.17576/apjitm-2015-0401-02

Hemmati, M., Amjady, N., & Ehsan, M. (2014). System modeling and optimization for

islanded micro-grid using multi-cross learning-based chaotic differential evolution algorithm.

International Journal of Electrical Power and Energy Systems, 56, 349–360.

http://doi.org/10.1016/j.ijepes.2013.11.015

Heydt, M. (2015). Learning pandas: Get to grips with pandas - a versatile and highperformance

Python library for data manipulation, analysis, and discovery. Retrieved from

http://gen.lib.rus.ec/book/index.php?md5=75566423DC8A5A9411165F24EF9DD886

Hu, B., Tang, B., Chen, Q., & Kang, L. (2016). A novel word embedding learning model

using the dissociation between nouns and verbs. Neurocomputing, 171, 1108–1117.

http://doi.org/10.1016/j.neucom.2015.07.046

Isa, N., Puteh, M., & Kamarudin, R. M. H. R. (2013). Sentiment classification of malay

newspaper using immune network (SCIN). Lecture Notes in Engineering and Computer

Science, 3 LNECS, 1543–1548. Retrieved from

http://www.scopus.com/inward/record.url?eid=2-s2.0-

84887882006&partnerID=40&md5=652fdc713458c4dfedcbc4e3a0b736b6

J.M., M. M. U. J. S.-C. S. M. J. G.-B. (2013). Named Entity Recognition: Fallacies challenges

and opportunities. Computer Standards and Interfaces,

3554824891(http://www.scopus.com/inward/record.url?eid=2-s2.0-

84878302542&partnerID=40&md5=fa0cc4fcfad6db514533c129e08333d6).

Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters,

31(8), 651–666. http://doi.org/10.1016/j.patrec.2009.09.011

Kanagavalli, R. V, & K, R. (2013). Detecting and resolving spatial ambiguity in text using

named entity extraction and Self-Learning fuzzy logic techniques. Retrieved from

http://arxiv.org/abs/1303.0445

Kantardzic, M. (2011). Data Mining: Concepts, Models, Method, and Algorithms (2nd

Edition) (2nd ed.). New Jersey: John Wiley & Sons, Inc.

Khalaf, Z. (2015). MAHIR System: Unsupervised Segmentation for Malay Spoken Broadcast

News Stories. International Journal of Information and Electronics Engineering, 5(3).

http://doi.org/10.7763/IJIEE.2015.V5.532

Kondrak, S. B. and G. (2007). Alignment-Based Discriminative String Similarity.

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics,

656–663.

Kraft, D. H., Martin-Bautista, M. J., Chen, J., & Sanchez, D. (2003). Rules and fuzzy rules in

text: Concept, extraction and usage. International Journal of Approximate Reasoning, 34(2–

3), 145–161. http://doi.org/10.1016/j.ijar.2003.07.005

Král, P. (2014). Named entities as new features for Czech document classification. Lecture

Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and

Lecture Notes in Bioinformatics), 8404 LNCS (PART 2), 417–427.

http://doi.org/10.1007/978-3-642-54903-8_35

Kummerfeld, J., & Curran, J. (2008). Classification of Verb-Particle Constructions with the

Google Web1T Corpus. Australasian Language Technology Association Workshop 2008, 6

(December), 55–63. Retrieved from http://aclweb.org/anthology-new/U/U08/U08-

1.pdf#page=114

Lafferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields:

Probabilistic models for segmenting and labeling sequence data. ICML ’01 Proceedings of the

Eighteenth International Conference on Machine Learning, 8(June), 282–289.

http://doi.org/10.1038/nprot.2006.61

Larasati, S. (2012). Towards an Indonesian-English {SMT} System: A Case Study of an

Under-Studied and Under-Resourced Language, Indonesian. {WDS}’12 Proceedings of

Contributed Papers, 123–129.

Le Nguyen, M., & Shimazu, A. (2014). A semi supervised learning model for mapping

sentences to logical forms with ambiguous supervision. In Data and Knowledge Engineering

(Vol. 90, pp. 1–12). Elsevier B.V. http://doi.org/10.1016/j.datak.2013.12.001

Le, T., Nguyen, K., Nguyen, V., Nguyen, V., & Phung, D. (2016). Scalable Support Vector

Machine for Semi-supervised Learning, 1–18. Retrieved from http://arxiv.org/abs/1606.06793

Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Arbor, A., & Jagadish, H. V.

(2008). Regular Expression Learning for Information Extraction. Conference on Empirical

Methods in Natural Language Processing, (October), 21–30. Retrieved from

http://portal.acm.org/citation.cfm?id=1613719

Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named

entity recognition. Workshop on Semi-Supervised Learning for Natural Language Processing,

(June), 58–65. http://doi.org/10.3115/1621829.1621837

Liu, X., Zhang, S., Wei, F., & Zhou, M. (2011). Recognizing Named Entities in Tweets. In

Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

(ACL), 1(2008), 359–367. Retrieved from http://acl.eldoc.ub.rug.nl/mirror/P/P11/P11-

1037.pdf

Lu, Y., Ji, D., Yao, X., Wei, X., & Liang, X. (2015). CHEMDNER system with mixed

conditional random fields and multi-scale word clustering. Journal of Cheminformatics,

7(Suppl 1), S4. http://doi.org/10.1186/1758-2946-7-S1-S4

Luis Eduardo, P., Iacobelli, F., & Su, S. (2015). Semi-Supervised Approach to Named Entity

Recognition in Spanish Applied to a Real-World Conversational System, 224–235.

http://doi.org/10.1007/978-3-319-19264-2

Luo, W., & Yang, F. (2016). An Empirical Study of Automatic Chinese Word Segmentation

for Spoken Language Understanding and Named Entity Recognition, 238–248.

Malanyon, D. (2009). Malay Lexical Analysis through Corpus-Based Approach.

Eprints.Usm.My. Retrieved from http://eprints.usm.my/10608/

Mangasi, T., Erwin, A., & Ipung, H. P. (2014). Defined entity extraction based on Indonesian

text document. In Proceedings - 2014 International Conference on ICT for Smart Society:

“Smart System Platform Development for City and Society, GoeSmart 2014”, ICISS 2014 (pp.

61–65). http://doi.org/10.1109/ICTSS.2014.7013152

Manning, C. D., & Raghavan, P. (2009). An Introduction to Information Retrieval. Online, 1,

1. http://doi.org/10.1109/LPT.2009.2020494

Markov, Z., & Larose, D. T. (2007). Data Mining the Web: Uncovering Patterns in Web

Content, Structure, and Usage. John Wiley & Sons, Inc.

Mikolov, T., Le, Q. V, & Sutskever, I. (2013). Exploiting Similarities among Languages for

Machine Translation. arXiv Preprint arXiv:1309.4168v1, 1–10. Retrieved from

http://arxiv.org/abs/1309.4168v1%5Cnhttp://arxiv.org/abs/1309.4168

Miner, G., Elder, J., Fast, A., Hill, T., Nisbet, R., & Delen, D. (2012). Practical Text Mining

and Statistical Analysis for Non-structured Text Data Applications, 1st ed. Elsevier.

Oklahoma: Academic Press. http://doi.org/10.1016/B978-0-12-386979-1.00009-8

Mohamed, H., Omar, N., & Ab. Aziz, M. J. (2015). Malay Part of Speech Tagger: A

Comparative Study on Tagging Tools. Asia-Pacific Journal of Information Technology and

Multimedia, 4(1), 11–23. http://doi.org/10.17576/apjitm-2015-0401-02

Mohd Don, Z. (2010). Processing natural malay texts: A data-driven approach. Trames, 14(1),

90–103. http://doi.org/10.3176/tr.2010.1.06

Mohit, B., Schneider, N., Bhowmick, R., Oflazer, K., & Smith, N. a. (2012). Recall-oriented

learning of named entities in Arabic Wikipedia. Proceedings of the 13th Conference of the

European Chapter of the Association for Computational Linguistics, 162–173. Retrieved

from http://dl.acm.org/citation.cfm?id=2380816.2380839

Nadeau, D. (2007). A survey of named entity recognition and classification. Linguisticae

Investigationes, 8(30), 3–26. http://doi.org/10.1075/li.30.1.03nad

Nogueira, T. M., Rezende, S. O., & Camargo, H. a. (2010). On the use of fuzzy rules to text

document classification. Hybrid Intelligent Systems (HIS), 2010 10th International

Conference on, 19–24. http://doi.org/10.1109/HIS.2010.5600076

Noh, N., Rusydi, M., Talib, A., Ahmad, A., Halim, S. A., & Mohamed, A. (2009). Malay

Language Document Identification Using BPNN. In Proceedings of the 10th WSEAS

international conference on Neural networks (pp. 163–168).

Nothman, J., Ringland, N., Radford, W., Murphy, T., & Curran, J. R. (2013). Learning

multilingual named entity recognition from Wikipedia. Sydney: Elsevier Science.

http://doi.org/10.1016/j.artint.2012.03.006

Ojo, A., & Adeyemo, A. B. (2012). Framework for Knowledge Discovery from Journal

Articles Using Text Mining Techniques. African Journal of Computing & ICT, 5(2), 35–44.

Retrieved from http://www.ajocict.net/uploads/Pre-print_-

_O__Ojo___A_B__Adeyemo__2012___Framework_for_Knowledge_Discovery_from_Journ

al_Articles_Using_Text_Mining_Techniques.pdf

Oudah, M., & Shaalan, K. (2012). A Pipeline Arabic Named Entity Recognition using a

Hybrid Approach. COLING (December 2012), 2159–2176. Retrieved from

http://www.newdesign.aclweb.org/anthology/C/C12/C12-1132.pdf

Oudah, M., & Shaalan, K. (2016). Studying the impact of language-independent and

language-specific features on hybrid Arabic Person name recognition. Language Resources

and Evaluation, 1–28. http://doi.org/10.1007/s10579-016-9376-1

Petrov, S., Das, D., & McDonald, R. (2011). A Universal Part-of-Speech Tagset. Retrieved

from http://arxiv.org/abs/1104.2086

Pham, Q. H., Nguyen, M.-L., Nguyen, B. T., & Cuong, N. V. (2015). Semi-supervised

Learning for Vietnamese Named Entity Recognition using Online Conditional Random Fields.

In Proceedings of the Fifth Named Entity Workshop (pp. 50–55). Retrieved from

http://www.aclweb.org/anthology/W15-3907

POWERS, D.M.W. (AILab, School of Computer Science, Engineering and Mathematics,

Flinders University, South Australia, A. (2011). Evaluation: From Precision, Recall and FMeasure

To Roc, Informedness, Markedness & Correlation. Journal of Machine Learning

Technologies, 2(1), 37–63. http://doi.org/10.1.1.214.9232

Powers, D. M. W. (2015). What the F-measure doesn’t measure: Features, Flaws, Fallacies

and Fixes, 19. http://doi.org/KIT-14-001

Prasad, G., Fousiya, K. K., Kumar, M. A., & Soman, K. P. (2015). Named Entity Recognition

for Malayalam Language : A CRF based Approach, (May), 16–19.

Ramli, I., Jamil, N., Seman, N., & Ardi, N. (2015). An Improved Syllabification for a Better

Malay Language Text-to-Speech Synthesis (TTS). 2015 IEEE International Symposium On

Robotics and Intelligent Sensors, 76 (Iris), 417–424.

http://doi.org/10.1016/j.procs.2015.12.280

Rao, R. V., & Saroj, A. (2017). A self-adaptive multi-population based Jaya algorithm for

engineering optimization. Swarm and Evolutionary Computation, (October 2016), 1–26.

http://doi.org/10.1016/j.swevo.2017.04.008

Ritter, A., Clark, S., Mausam, & Etzioni, O. (2011). Named Entity Recognition in Tweets: An

Experimental Study. Proceedings of the 2011 Conference on Empirical Methods in Natural

Language Processing, 1524–1534. Retrieved from http://dl.acm.org/citation.cfm?id=2145595

Rosso, P., Benajiba, Y., & Lyhyaoui, A. (2006, December). Towards an Arabic question

answering system. In Proc. 4th Conf. on Scientific Research Outlook & Technology

Development in the Arab world, SROIV, Damascus, Syria (pp. 11-14).

Rozenfeld, B., & Feldman, R. (2008). Self-supervised relation extraction from the Web.

Knowledge and Information Systems, 17(1), 17–33. http://doi.org/10.1007/s10115-007-0110-

Sam, R. C., Le, H. T., Nguyen, T. T., & Nguyen, T. H. (2011). Combining proper namecoreference

with conditional random fields for semi-supervised named entity recognition in

Vietnamese text. Lecture Notes in Computer Science (Including Subseries Lecture Notes in

Artificial Intelligence and Lecture Notes in Bioinformatics), 6634 LNAI (PART 1), 512–524.

http://doi.org/10.1007/978-3-642-20841-6-42

Samat, N. A., Murad, M. A. A., Abdullah, M. T., & Atan, R. (2005). Malay Documents

Clustering Algorithm Based on Singular Value Decomposition. Journal of Theoretical and

Applied Information Technology, 180–186.

Sari, Y., Hassan, M. F., & Zamin, N. (2009). A Hybrid Approach to Semi-supervised Named

Entity Recognition in Health, Safety and Environment Reports. 2009 International

Conference on Future Computer and Communication, 599–602.

http://doi.org/10.1109/ICFCC.2009.52

Sari, Y., Hassan, M. F., & Zamin, N. (2010). Rule-based pattern extractor and Named Entity

Recognition: A hybrid approach. In Proceedings 2010 International Symposium on

Information Technology - Engineering Technology, ITSim’10 (Vol. 2, pp. 563–568).

http://doi.org/10.1109/ITSIM.2010.5561392

Satoshi Sekine, K. S., & Nobata, C. (2002). Extended named entity hierarchy. Third

International Conference on Language Resources and Evaluation (LREC 2002), 1818–1824.

Sazali, S. S., Rahman, N. A., & Bakar, Z. A. (2017). Information extraction: Evaluating

named entity recognition from classical Malay documents. In 2016 3rd International

Conference on Information Retrieval and Knowledge Management, CAMP 2016 - Conference

Proceedings (pp. 48–53). http://doi.org/10.1109/INFRKM.2016.7806333

Seeger, M., & King, I. (2002). Learning from labeled and unlabeled data. Learning, (January),

1–62. http://doi.org/10.1109/IJCNN.2002.1007592

Sekine, S., Sudo, K., & Nobata, C. (2002, May). Extended Named Entity Hierarchy. In LREC.

Selvaperumal, P., & Suruliandi, A. (2016). Semi-Supervised Personal Name Disambiguation

Technique for the Web. International Journal of Modern Education and Computer Science,

8(3), 28–36. http://doi.org/10.5815/ijmecs.2016.03.04

Servan, C., Berard, A., Elloumi, Z., Blanchon, H., & Besacier, L. (2016). Word2Vec vs

DBnary: Augmenting METEOR using Vector Representations or Lexical Resources?

Retrieved from http://arxiv.org/abs/1610.01291

Shaalan, K., & Oudah, M. (2013). A hybrid approach to Arabic named entity recognition.

Journal of Information Science, 40(1), 67–87. http://doi.org/10.1177/0165551513502417

Shaalan, K., & Raza, H. (2007). Person Name Entity Recognition for Arabic. Computational

Linguistics, (June), 17–24. http://doi.org/10.3115/1654576.1654581

Shabat, H. (2015). Named Entity Recognition in Crime News Documents Using Classifiers

Combination, 23(6), 1215–1222. http://doi.org/10.5829/idosi.mejsr.2015.23.06.22271

Sharma, D., Devale, P. R., & Khare, A. K. (2011). Approach for Multiword Expression

Identification in Natural Language Processing, 2 (August 2011), 663–666.

Sidi. (2011). Malay Interrogative Knowledge Corpus. American Journal of Economics and

Business Administration, 3, 171–176. http://doi.org/10.3844/ajebasp.2011.171.176

Sinoara, R. A., Sundermann, C. V., Marcacini, R. M., Domingues, M. A., & Rezende, S. O.

(2014). Named entities as privileged information for hierarchical text clustering. Proceedings

of the 18th International Database Engineering & Applications Symposium on - IDEAS ’14,

57–66. http://doi.org/10.1145/2628194.2628225

Srivastava, A. N., & Sahami, M. (2009). Text Mining: Classification, Clustering, and

Applications. Boca Raton: Chapman and Hall/CRC.

Suakkaphong, N., Zhang, Z., & Chen, H. (2013). Disease Named Entity Recognition Using

Semisupervised Learning and Conditional Random Fields. Journal of the American Society

for Information Science and Technology, 14(4), 90–103. http://doi.org/10.1002/asi

Sun, a, Grishman, R., & Sekine, S. (2011). Semi-supervised relation extraction with largescale

word clustering. Proceedings of the 49th Annual Meeting …, 521–529. Retrieved from

http://www.aaai.org/Papers/AAAI/2007/AAAI07-

224.pdf%5Cnhttp://dl.acm.org/citation.cfm?id=2002539

Suwarningsih, W., Supriana, I., & Purwarianti, A. (2015). ImNER Indonesian medical named

entity recognition. In Proceedings of 2014 2nd International Conference on Technology,

Informatics, Management, Engineering and Environment, TIME-E 2014 (pp. 184–188).

http://doi.org/10.1109/TIME-E.2014.7011615

Tabuchi, N., Sumii, E., & Yonezawa, A. (2003). Regular expression types for strings in a text

processing language. Electronic Notes in Theoretical Computer Science, 75, 97–115.

http://doi.org/10.1016/S1571-0661 (04)80781-3

Tan, T. P., Xiao, X., Tang, E. K., Chng, E. S., & Li, H. (2009). MASS: A Malay language

LVCSR corpus resource. 2009 Oriental COCOSDA International Conference on Speech

Database and Assessments, ICSDA 2009, 25–30.

http://doi.org/10.1109/ICSDA.2009.5278382

Tran, V. C., Hwang, D., & Jung, J. J. (2015). Semi-supervised Approach Based on Cooccurrence

Coefficient for Named Entity Recognition on Twitter, 141–146.

Triguero, I., García, S., & Herrera, F. (2013). Self-labeled techniques for semi-supervised

learning: taxonomy, software and empirical study. Knowledge and Information Systems, pp.

1–40. http://doi.org/10.1007/s10115-013-0706-y

Triguero, I., Sáez, J. A., Luengo, J., García, S., & Herrera, F. (2014). On the characterization

of noise filters for self-training semi-supervised in nearest neighbor classification.

Neurocomputing, 132, 30–41. http://doi.org/10.1016/j.neucom.2013.05.055

Trstenjak, B., Mikac, S., & Donko, D. (2014). KNN with TF-IDF based framework for text

categorization. In Procedia Engineering (Vol. 69, pp. 1356–1364). Elsevier B.V.

http://doi.org/10.1016/j.proeng.2014.03.129

Tuffery, S. (2011). Data Mining and Statistics for Decision Making. Wiley.

Turian, J., Ratinov, L., Bengio, Y., & Turian, J. (2010). Word Representations: A Simple and

General Method for Semi-supervised Learning. Proceedings of the 48th Annual Meeting of

the Association for Computational Linguistics, (July), 384–394.

http://doi.org/10.1.1.301.5840

Wibawa, A. S., & Purwarianti, A. (2016). Indonesian Named-entity Recognition for 15

Classes Using Ensemble Supervised Learning. Procedia Computer Science, 81(May), 221–

228. http://doi.org/10.1016/j.procs.2016.04.053

Witten, I. H., Frank, E., & Hall, M. (2011). Data Mining: Practical Machine Learning Tools

and Techniques (2nd ed.). http://doi.org/citeulike-article-id:8827086

Worden, K., Staszewski, W. J., & Hensman, J. J. (2011). Natural computing for mechanical

systems research: A tutorial overview. Mechanical Systems and Signal Processing. Elsevier.

http://doi.org/10.1016/j.ymssp.2010.07.013

Wu, X., Kumar, V., Ross, Q. J., Ghosh, J., Yang, Q., Motoda, H.,Steinberg, D. (2008). Top

10 algorithms in data mining. Knowledge and Information Systems (Vol. 14).

http://doi.org/10.1007/s10115-007-0114-2

Xian, B. C. M., Lubani, M., Ping, L. K., Bouzekri, K., Mahmud, R., & Lukose, D. (2016).

Benchmarking Mi-POS: Malay Part-of-Speech Tagger. International Journal of Knowledge

Engineering, 2(3), 115–121. http://doi.org/10.18178/ijke.2016.2.3.064

Yang, F., & Vozila, P. (2014). Semi-Supervised Chinese Word Segmentation Using Partial-

Label Learning With Conditional Random Fields. Emnlp, 90–98. Retrieved from

http://emnlp2014.org/papers/pdf/EMNLP2014010.pdf

Yesilbudak, M., Sagiroglu, S., & Colak, I. (2017). A novel implementation of kNN classifier

based on multi-tupled meteorological input data for wind power prediction. Energy

Conversion and Management, 135, 434–444. http://doi.org/10.1016/j.enconman.2016.12.094

Yong, S.-F., Ranaivo-Malan?on, B., & Wee, A. Y. (2011). NERSIL : the named-entity

recognition system for Iban language. 25th Pacific Asia Conference on Language,

Information and Computation, 549–558.

Yong, Z., Youwen, L., & Shixiong, X. (2009). An Improved KNN Text Classification

Algorithm Based on Clustering. Journal of Computers, 4(3), 230–237.

http://doi.org/10.4304/jcp.4.3.230-237

Zamin, N., & Oxley, A. (2011). Building a Corpus-Derived Gazetteer for Named Entity

Recognition, 73–80.

Zamin, N., Oxley, A., Abu Bakar, Z., & Farhan, S. A. (2012). A statistical dictionary-based

word alignment algorithm: An unsupervised approach. In 2012 International Conference on

Computer and Information Science, ICCIS 2012 - A Conference of World Engineering,

Science and Technology Congress, ESTCON 2012 - Conference Proceedings (Vol. 1, pp.

396–402). http://doi.org/10.1109/ICCISci.2012.6297278

Zatarain Salazar, J., Reed, P. M., Herman, J. D., Giuliani, M., & Castelletti, A. (2016). A

diagnostic assessment of evolutionary algorithms for multi-objective surface water reservoir

control. Advances in Water Resources, 92, 172–185.

http://doi.org/10.1016/j.advwatres.2016.04.006

Zeng, H., Song, A., & Cheung, Y. M. (2013). Improving clustering with pairwise constraints:

A discriminative approach. Knowledge and Information Systems, 36(2), 489–515.

http://doi.org/10.1007/s10115-012-0592-8

Zhan, Q. (2017). An Improved K-means Algorithm Based on Structure Features, 12(1), 62–80.

http://doi.org/10.17706/jsw.12.1.62-81

Zhang, C., Hong, X., & Peng, Z. (2012). An automatic approach to harvesting temporal

knowledge of entity relationships. In Procedia Engineering (Vol. 29, pp. 1399–1409).

http://doi.org/10.1016/j.proeng.2012.01.147

Zhang, S., & Elhadad, N. (2013). Unsupervised biomedical named entity recognition:

Experiments with clinical and biological texts. Journal of Biomedical Informatics, 46(6),

1088–1098. http://doi.org/10.1016/j.jbi.2013.08.004

Zhou, D., & Zhong, D. (2015). A semi-supervised learning framework for biomedical event

extraction based on hidden topics. Artificial Intelligence in Medicine, 64(1), 51–58.

http://doi.org/10.1016/j.artmed.2015.03.004

Zirikly, A., & Diab, M. (2015). Named Entity Recognition for Arabic Social Media.

Proceedings of NAACL-HLT 2015, 176–185. Retrieved from

http://www.aclweb.org/anthology/W15-1524.pdf

This material may be protected under Copyright Act which governs the making of photocopies or reproductions of copyrighted materials.
You may use the digitized material for private study, scholarship, or research.

Back to search page