Building controlled bilingual terminologies for the municipal domain and evaluating them using a coverage estimation
approach
This article examines the status of constructed controlled terminologies from the perspective of the coverage of terms/concepts. To
facilitate controlled authoring of Japanese texts of the municipal domain and promote machine translatability into English, we
constructed terminologies in the following way: (1) Japanese-English term pairs are extracted from aligned texts; (2) term
variations are controlled by defining preferred and proscribed terms for both languages. To assess the coverage of the constructed
terminologies, we propose a quantitative extrapolation method that estimates the potential vocabulary size. The coverage
estimations show that the coverage of terms for Japanese is higher than that for English by about 10%, which
reflects the greater diversity of the translated English terms. The coverage of concepts reaches around 60% for
both Japanese and English. The method also enables us to quantitatively estimate how much effort is needed to further increase the
coverage.
Article outline
- 1.Introduction
- 2.Related work
- 2.1Term extraction
- 2.2Term variation management
- 2.3Terminology evaluation
- 3.Building controlled bilingual terminology
- 3.1Parallel corpus compilation
- 3.2Term collection
- 3.2.1Terms to be collected
- 3.2.2Term extraction platform and procedure
- 3.2.3Extracted terms
- 3.3Typology of term variation
- 3.4Terminology control
- 4.The coverage estimation approach to evaluate constructed terminologies
- 4.1Self-referring coverage estimation
- 4.2Expected number of terms
- 4.3Growth rate of terms
- 4.4Conditions to be observed
- 5.Results and discussions
- 5.1Population types and present status of terminologies
- 5.2Growth patterns of terminology
- 5.3Examination of models with different data points
- 5.4Use of lexical items for the coverage estimation of terminology
- 6.Conclusions and future work
- Acknowledgements
- Notes
-
References
References (61)
References
Baayen, Harald. 2001. Word Frequency Distributions. Dordrecht: Kluwer Academic Publishers.
Baayen, Harald. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge: Cambridge University Press.
Baroni, Marco, and Silvia Bernardini (eds). 2006. Wacky! Working papers on the Web as Corpus. Bologna: Gedit.
Biber, Douglas. 1993. “Representativeness in Corpus Design.” Literary and Linguistic Computing 8 (4): 243–257.
Carroll, John B. 1969. “A Rationale for an Asymptotic Lognormal Form of Word-Frequency Distributions.” Research Bulletin. Princeton, New Jersey: Educational Testing Service.
Daille, Béatrice. 1996. “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.” In The Balancing Act: Combining Symbolic and Statistical Approaches to Language, ed. by Philip Resnik, and Judith L. Klavans, 49–66. Cambridge: MIT Press.
Daille, Beatrice. 2003. “Conceptual Structuring Through Term Variations.” In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (MWE), 9–16, Sapporo, Japan.
Daille, Béatrice, Eric Gaussier, and Jean-Marc Langé. 1994. “Towards Automatic Extraction of Monolingual and Bilingual Terminology.” In Proceedings of the 15th International Conference on Computational Linguistics (COLING), 515–521, Kyoto, Japan.
Damerau, Fred J. 1990. “Evaluating Computer-generated Domain-oriented Vocabularies.” Information Processing & Management 26 (6): 791–801.
Désilets, Alain, Louis-Phillippe Huberdeau, Marc Laporte, and Jean Quirion. 2009. “Building a Collaborative Multilingual Terminology System.” In Proceedings of the 31st Conference of Translating and the Computer, London, UK.
Dillinger, Mike. 2001. “Dictionary Development Workflow for MT: Design and Management.” In Proceedings of the Machine Translation Summit VIII, 83–88, Galicia, Spain.
Efron, Bradley, and Ronald Thisted. 1976. “Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know?” Biometrika 63 (3): 435–447.
Evert, Stefan. 2004. “A Simple LNRE Model for Random Character Sequences.” In Proceedings of the 7es Journées internationales d’Analyse statistique des Données Textuelles (JADT), 411–422, Louvain-la-Neuve, France.
Evert, Stefan, and Marco Baroni. 2005. “Testing the Extrapolation Quality of Word Frequency Models.” In Proceedings of the Corpus Linguistics 2005, Birmingham, UK.
Evert, Stefan, and Marco Baroni. 2007. “zipfR: Word Frequency Distributions in R.” In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), Posters and Demonstrations Session, 29–32, Prague, Czech Republic.
Fletcher, William. 2004. “Making the Web More Useful as a Source for Linguistic Corpora.” In Applied Corpus Linguistics: A Multidimensional Perspective, ed. by Ulla Connor, and Thomas Upton, 191–205. Amsterdam: Rodopi.
Foo, Jody. 2012. Computational Terminology: Exploring Bilingual and Monolingual Term Extraction. Licentiate thesis, Linköping University.
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima. 2000. “Automatic Recognition of Multi-Word Terms: The C-Value/NC-Value Method.” International Journal on Digital Libraries 3 (2): 115–130.
Frantzi, Katerina, Sophia Ananiadou, and Junichi Tsujii. 1998. “The C-Value/NC-Value Method of Automatic Recognition for Multi-Word Terms.” In Research and Advanced Technology for Digital Libraries: Proceedings of the Second European Conference (ECDL), ed. by Christos Nikolaou, and Constantine Stephanidis, 585–604. Berlin, Heidelberg: Springer.
Gaussier, Éric. 1998. “Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora.” In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL-COLING), 444–450, Montreal, Quebec, Canada.
Gray, Bethany, Jesse Egbert, and Douglas Biber. 2017. “Exploring Methods for Evaluating Corpus Representativeness.” In Proceedings of the 9th International Corpus Linguistics Conference, 563–566, Birmingham, UK.
Heylen, Kris, and Dirk De Hertog. 2015. “Automatic Term Extraction.” In Handbook of Terminology, vol. 11, ed. by Hendrik J. Kockaert, and Frieda Steurs, 203–221. Amsterdam: John Benjamins.
Itagaki, Masaki, Takako Aikawa, and Xiaodong He. 2007. “Automatic Validation of Terminology Translation Consistency with Statistical Method.” In Proceedings of the Machine Translation Summit XI, 269–274, Copenhagen, Denmark.
Jacquemin, Christian. 2001. Spotting and Discovering Terms through Natural Language Processing. Cambridge: The MIT Press.
Kageura, Kyo, and Genichiro Kikui. 2006. “A Self-Referring Quantitative Evaluation of the ATR Basic Travel Expression Corpus (BTEC).” In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), 1945–1950, Genoa, Italy.
Khmaladze, Estate V. 1987. The Statistical Analysis of Large Numbers of Rare Events. Technical Report MS-R8804, Department of Mathematical Sciences, CWI, Amsterdam.
Kim, Young-Gil, Seong-Il Yang, Munpyo Hong, Chang-Hyun Kim, Young-Ae Seo, Cheol Ryu, Sang-Kyu Park, and Se-Young Park. 2005. “Terminology Construction Workflow for Korean-English Patent MT.” In Proceedings of the Machine Translation Summit X, 55–59, Phuket, Thailand.
Kudo, Taku, Kaoru Yamamoto, and Yuji Matsumoto. 2004. “Applying Conditional Random Fields to Japanese Morphological Analysis.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 230–237, Barcelona, Spain.
Kupiec, Julian. 1993. “An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora.” In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics (ACL), 17–22, Columbus, Ohio, USA.
Langlais, Philippe. 2017. “Users and Data: The Two Neglected Children of Bilingual Natural Language Processing Research.” In Proceedings of the 10th Workshop on Building and Using Comparable Corpora (BUCC), 1–5, Vancouver, Canada.
Leech, Geoffrey. 2007. “New Resources, or Just Better Old Ones? The Holy Grail of Representativeness.” In Corpus Linguistics and the Web, ed. by Marianne Hundt, Nadja Nesselhauf, and Carolin Biewer, 133–149. Amsterdam: Rodopi.
Miyata, Rei, and Kyo Kageura. 2016. “Constructing and Evaluating Controlled Bilingual Terminologies.” In Proceedings of the 5th International Workshop on Computational Terminology (CompuTerm), 83–93. Osaka, Japan.
Miyata, Rei, Anthony Hartley, Kyo Kageura, and Cécile Paris. 2017. “Evaluating the Usability of a Controlled Language Authoring Assistant.” The Prague Bulletin of Mathematical Linguistics 1081: 147–158.
Møller, Margrethe H., and Ellen Christoffersen. 2006. “Buiding a Controlled Language Lexicon for Danish.” LSP & Professional Communication 6 (1): 26–37.
Sanseido. 2002. Grand Concise Japanese-English Dictionary. Tokyo: Sanseido.
Sato, Koichi, Koichi Takeuchi, and Kyo Kageura. 2013. “Terminology-driven Augmentation of Bilingual Terminologies.” In Proceedings of the Machine Translation Summit XIV, 3–10, Nice, France.
Sharoff, Serge, and Anthony Hartley. 2012. “Lexicography, Terminology and Ontologies.” In Handbook of Technical Communication, ed. by Alexander Mehler, and Laurent Romary, 317–346. Boston: De Gruyter Mouton.
Sichel, Herbert S. 1975. “On a Distribution Law for Word Frequencies.” Journal of the American Statistical Association 70 (351a): 542–547.
Simon, Herbert. 1960. “Some Further Notes on a Class of Skew Distribution Functions.” Information and Control 3 (1): 80–88.
TerminOrgs. 2012. “Terminology Starter Guide.” [URL]. Accessed 21 September 2018.
Thicke, Lori. 2011. “Improving MT Results: A Study.” Multilingual January/February: 37–40.
Toutanova, Kristina, Dan Klein, Christopher Manning, and Yoram Singer. 2003. “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network.” In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 173–180, Edmonton, Canada.
Tsuji, Keita, and Kyo Kageura. 2004. “Extracting Low-frequency Translation Pairs from Japanese-English Bilingual Corpora.” In Proceedings of the 3rd International Workshop on Computational Terminology (CompuTerm), 23–30, Geneva, Switzerland.
Tuldava, Juhan. 1995. Methods in Quantitative Linguistics. Trier: Wissenschaftlicher Verlag Trier.
Warburton, Kara. 2014. “Developing Lexical Resources for Controlled Authoring Purposes.” In Proceedings of LREC 2014 Workshop: Controlled Natural Language Simplifying Language Use, 90–103, Reykjavik, Iceland.
Warburton, Kara. 2015b. “Terminology Management.” In Routledge Encyclopedia of Translation Technology, ed. by Sin-Wai Chan, 644–661. New York: Routledge.
Yoshikane, Fuyuki, Tsuji Keita, Kyo Kageura, and Christian Jacquemin. 2003. “Morpho-Syntactic Rules for Detecting Japanese Term Variation: Establishment and Evaluation.” Journal of Natural Language Processing 10 (4): 3–32.
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 5 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.