An automatic bilingual corpora generator

Siti Nordianah Hai Hom

QR Code Link :
Type :	Article
Subject :	QA Mathematics
ISSN :	0127-9750
Main Author :	Siti Nordianah Hai Hom
Additional Authors :	Azniah Ismail
Title :	An automatic bilingual corpora generator
Hits :	76

Place of Production :	Tanjong Malim
Publisher :	Fakulti Komputeran dan META-Teknologi
Year of Publication :	2014
Notes :	Vol. 1 (2014): Journal of ICT in Education (JICTIE)
Corporate Name :	Perpustakaan Tuanku Bainun
PDF Full Text :	You have no permission to view this item.

Abstract : Perpustakaan Tuanku Bainun

Bilingual corpora that contains similar documents of two different languages are examples of essential resources for Natural Language Processing (NLP) tasks including Cross-Lingual Information Retrieval (CLIR) and machine translation. Nevertheless, these resources could also be useful for many processes in learning languages. We introduce an automatic bilingual corpora generator that builds corpus resources from the web. This generator involves the use of the in-domain terms (IDT), in which the terms can be thought of as the most important contextually relevant words. The method used is simple yet practical, and makes acquiring resources from web sources more than just collecting texts and pasting them all together. However, as an on-going project, the system has not been fully implemented and evaluated. In this paper, the researchers emphasizes more on the prototype of the system in terms of appearance and display. For example, the generator shall be built on a webbased system that gives different options to users on how they would like to observe the acquired texts. Keywords Bilingual Corpora, in-domain-term (IDT)

References

Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F.-J.,

Purdy, D., Smith, N., and Yarowsky, D. (1999). Statistical machine translation.

Technical Report Center for Language and Speech Processing, P l a c e : John Hopkins

University.

Azniah {Formatting Citation} Ismail (2012). Minimally Supervised Techniques for

Bilingual Lexicon Extraction, Ph.D Thesis. York University.

Biemann, C. (2006). Chinese Whispers – an Efficient Graph Clustering Algorithm and

its Application to Natural Languange Processing Problems. In Proceeding of the

Human Languange Technology- North American Chapter of the Association for

Computational Linguistics (HLTNAACL).

Biemann, C., Teresniak, S. (2005). Disentangling from Babylonian Confusion –

Unsupervised Languange Identification. In Proceedings of Conference on

Intelligent Text Processing and Coutational Linguistics (CICLing).Place: Publisher

Chinese Whispers Clustering. Retrieved from https://marketplace.gephi.org/plugin/

chinese-whispers-clustering/

Cirrus Word Cloud. Retrieved from http://voyeurtools.org/tool/Cirrus/

Cooltext Graphics Generator. Retrieved from http://cooltext.com/?gclid

=CMTyoYPemLoCFWsF4godSUgA9g

Fung, P. and Cheung, P. (2004). Multi-level bootstrapping for extracting parallel

sentences from a quasi-comparable corpus. In Proceedings of the 20th International

Conference on Computational Linguistics (COLING): Place: Date

Futrell, R. T., Shafer, D.T. & Shafer, L. (2002), Quality Software Project Management.

Place: Prentice Hall.

Lou, B. (2009). British National Corpus. Retrieved from http://www.natcorp.ox.ac.uk/

Prototype model. Retrieved from http://csebrules.blogspot.com/2011/01/assignment-

2-task-2-prototyping-model.html

Prototype model. Retrieved from http://istqbexamcertification.com/what-is-prototypemodel-

advantages-disadvantages-and-when-to-use-it/

SDLC – Incremental Model (2009), Quality Testing. Retrieved from http://www.

qualitytesting.info/profiles/blogs/sdlc-incremental-model

SEAlang Library Malay. Retrieved from http://www.sealang.net/malay/corpus.htm

Somers, H. (2001). Bilingual parallel corpora and language engineering. In Anglo-

Indian Workshop Language Engineering for South-Asian Languages (LESAL).

Place:Date.

This material may be protected under Copyright Act which governs the making of photocopies or reproductions of copyrighted materials.
You may use the digitized material for private study, scholarship, or research.

Back to search page