Corpus-based bilingual terminology extraction in the power engineering domain
This paper presents the resources and tools used to extract and evaluate bilingual, English-Serbian terminology in
the power engineering domain. The resources consist of existing general and domain lexica, and a domain parallel corpus; tools
include term extractors for both languages and a tool for aligning the segments belonging to corpus sentences. The system was
tested by varying a match function that establishes the presence of an extracted term in an aligned segment (a chunk), ranging
from very loose to strict. The evaluation of results showed that the precision of English term extraction was 92%, Serbian term
extraction 86%, while the precision of bilingual pair extraction was 72% based on the strictest match function. The result of
extraction was 2,684 correct bilingual pairs that enhanced the terminology database and can further be used to support the search
of the power engineering aligned collection stored in a digital library.
Article outline
- 1.Introduction – motivation
- 2.Terminology resources for the power engineering domain
- 3.The work related to bilingual terminology extraction
- 4.Bilingual terminology compilation
- 4.1PE bilingual parallel corpus
- 4.2Terminology extraction for Serbian
- 4.3Terminology extraction for English
- 4.4Alignment of chunks
- 4.5Matching terms and chunks
- 5.Discussion
- 5.1Evaluation of results
- 5.2Performance of Serbian and English extractors
- 5.3Publishing of results
- 6.Conclusion
- Notes
-
References
-
Dictionaries
References (76)
References
Andonovski, Jelena, Branislava Šandrih, and Olivera Kitanović. 2019. “Bilingual
Lexical Extraction Based on Word Alignment for Improving Corpus Search.” The Electronic
Library 37 (4): 722–739.
Anđelković, Jelena, Danica Seničić, and Ranka Stanković. 2018. “Aligned
Parallel Corpus for the Domain of Management.” Infotheca – Journal for Digital
Humanities 18.21 (2018): 7–28.
Arcan, Mihael, Marco Turchi, Sara Tonelli, and Paul Buitelaar. 2017. “Leveraging
Bilingual Terminology to Improve Machine Translation in a CAT Environment.” Natural Language
Engineering 23 (5): 763–788.
Artetxe, Mikel, Gorka Labaka, and Eneko Agirre. 2016. “Learning
Principled Bilingual Mappings of Word Embeddings while Preserving Monolingual
Invariance”. In Proceedings of the 2016 Conference on Empirical
Methods in Natural Language
Processing, pp. 2289–2294.
Artetxe, Mikel, Gorka Labaka, and Eneko Agirre. 2019. “Bilingual
Lexicon Induction through Unsupervised Machine
Translation”. In Proceedings of the 57th Annual Meeting of the
Association for Computational
Linguistics, pp. 5002–5007.
Blagojević, Branislav. 2011. “Developing
of the Geologic Terminology for the Geologic Database of Serbia.” 17th Meeting of the
Association of European Geological
Societies, Belgrade. 2011.
Bugarski, Ranko. 2007. Lingvistika u primeni. [Linguistics in
application.] Beograd: Čigoja.
Cimiano, Philipp, John P. McCrae and Paul Buitelaar. 2016. Lexicon
Model for Ontologies: Community Report. [URL], accessed 20.02.2021.
Cohen, Jacob. 1960. “A
Coefficient of Agreement for Nominal Scales.” Educational and Psychological
Measurement 20 (1): 37–46.
Cram, Damien, and Béatrice Daille. 2016. “Terminology
Extraction with Term Variant Detection.” In Proceedings of ACL-2016
System
Demonstrations, pp. 13–18. Association for Computational Linguistics.
Dunning, Ted. 1993. Accurate
methods for the statistics of surprise and coincidence. Computational
Linguistics 19(1): 61–74.
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima. 2000. “Automatic
Recognition of Multi-Word Terms: the C-value/NC-value Method.” International Journal on Digital
Libraries 3(2): 15–130.
Gelbukh, Alexander, Grigori Sidorov, Eduardo Lavin-Villa, and Liliana Chanona-Hernandez. 2010. “Automatic
Term Extraction using Log-likelihood Based Comparison with General Reference
Corpus.” In International conference on application of natural
language to information
systems, pp. 248–255. Springer, Berlin, Heidelberg.
Hakami, Huda, and Danushka Bollegala. 2017. “A
Classification Approach for Detecting Cross-Lingual Biomedical Term Translations.” Natural
Language
Engineering 23 (1): 31–51.
Haque, Rejwanul, Sergio Penkale, and Andy Way. 2018. “TermFinder:
Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation Models for Bilingual Terminology
Extraction.” Language Resources and
Evaluation 52 (2): 365–400.
Hazem, Amir, and Emmanuel Morin. 2016. “Efficient
Data Selection for Bilingual Terminology Extraction from Comparable
Corpora.” In Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical
Papers, 3401–3411. The COLING 2016 Organizing Committee.
ISO 15188:2001 Project management guidelines for terminology
standardization [URL]
ISO 30042:2019 Management of terminology resources – TermBase eXchange
(TBX) [URL]
Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel. 2014. The
Sketch Engine: ten years
on. Lexicography 1(1): 7–36.
Krstev, Cvetana, Gordana Pavlović-Lažetić, Duško Vitas, and Ivan Obradović. 2004. “Using
Textual and Lexical Resources in Developing Serbian Wordnet.” Romanian Journal of Information
Science and
Technology 7 (1–2): 147–161. Publishing House of the Romanian Academy.
Krstev, Cvetana. 2008. Processing
of Serbian – Automata, Texts and Electronic Dictionaries. Faculty of Philology, University of Belgrade, Serbia.
Krstev, Cvetana, Staša Vujičić Stanković, and Duško Vitas. 2014. “Approximate
Measures in the Culinary Domain: Ontology and Lexical
Resources.” In Proceedings of the 9th Language Technologies
Conference IS-LT 2014, Ljubljana, Slovenia, pp. 38–43. Institut “Jožef Stefan”. ISBN: 978-961-264-077-4.
Liu, Jingshu, Emmanuel Morin, and Peña Saldarriaga. 2018. “Towards
a Unified Framework for Bilingual Terminology Extraction of Single-Word and Multi-Word
Terms.” In Proceedings of the 27th International Conference on
Computational
Linguistics, pp. 2855–2866. ACL.
McCrae, John. P., Julia Bosque-Gil, Jorge Gracia, Paul Buitelaar, and Philipp Cimiano. 2017. “The
OntoLex-Lemon Model: Development and Applications.” In Proceedings of
eLex 2017
conference, pp. 19–21. Lexical
Computing CZ s.r.o.
Milić, Mira. 2013. “The
influence of English on Serbian Sports Terminology” ESP Today–Journal of English for specific
purposes at tertiary
level 1 (1): 65–79.
Och, Franz Josef, and Hermann Ney. 2003. “A
Systematic Comparison of Various Statistical Alignment Models.” Computational
Linguistics 29 (1): 19–51.
Oliver, Antoni. 2017. “A
System for Terminology Extraction and Translation Equivalent Detection in Real Time.” Machine
Translation 31 (3): 147–161.
Oliver, Antoni, and Mercè Vàzquez. 2015. “TBXTools:
A Free, Fast and Flexible Tool for Automatic Terminology
Extraction.” In Proceedings of the International Conference Recent
Advances in Natural Language
Processing, 473–479. INCOMA Ltd. Shoumen, BULGARIA.
Pajić, Vesna, Staša Stanković-Vujičić, Ranka Stanković and Miloš Pajić. 2018. “Semi-automatic
Extraction of Multiword Terms from Domain-Specific Corpora”. The Electronic
Library 36 (3): 550–567.
Pianta, Emanuele, Christian Girardi, and Roberto Zanoli. 2008. “The
TextPro Tool Suite.” In: Proceedings of the 6th International
Conference on Language Resources and
Evaluation, pp. 2603–2607. European Language Resources Association.
Pinnis, Mārcis, Nikola Ljubešić, Dan Stefanescu, Inguna Skadina, Marko Tadić, and Tatiana Gornostay. 2012. “Term
Extraction, Tagging, and Mapping Tools for Under-Resourced
Languages”. In Proceedings of the 10th Conference on Terminology and
Knowledge Engineering (TKE
2012), pp. 20–21.
Radojičić, Marija, Ivan Obradović, Ranka Stanković, Miloć Utvić, and Sebastijan Kaplar. 2018. “A
Mathematical Learning Environment Based on Serbian Language
Resources”. In Proceedings of the 7th International Scientific
Conference Technics and Informatics in Education. Faculty of Technical Sciences, Čačak.
Semmar, Nasredine. 2018. “A
Hybrid Approach for Automatic Extraction of Bilingual Multiword Expressions from Parallel
Corpora.” In Proceedings of the 11th International Conference on
Language Resources and
Evaluation, 311–318. European Language Resources Association.
Spasić, Irena, Mark Greenwood, Alun Preece, Nick Francis and Glyn Elwyn. 2013. “FlexiTerm:
a Flexible Term Recognition Method.” Journal of Biomedical
Semantics, 4 (1).
Stanković, Ranka, Cvetana Krstev, Duško Vitas, Nikola Vulović, and Olivera Kitanović. 2017. “Keyword-Based
Search on Bilingual Digital Libraries”. In Semantic Keyword-Based
Search on Structured Data Sources – Second COST Action IC1302 International KEYSTONE Conference, IKC
2016, 112–123. Springer.
Stanković, Ranka, Cvetana Krstev, Ivan Obradović, Biljana Lazić, and Aleksandra Trtovac. 2016. “Rule-based
Automatic Multi-word Term Extraction and
Lemmatization”. In Proceedings of the 10th International Conference
on Language Resources and
Evaluation, LREC 2016, Portorož, Slovenia, eds. Nicoletta Calzolari et al., ISBN 978-2-9517408-9-1.
Stanković, Ranka, Ivan Obradović, Olivera Kitanović, and Ljiljana Kolonja. 2012. “Towards
a Mining Equipment Ontology”. In Proceedings of the 12th
International Conference Research and Development in Mechanical Industry (RaDMI
2012); SaTCIP (Scientific and Technical Center for Intellectual Property) Ltd., Serbia: Vrnjačka Banja, Serbia, pp. 108–118.
Tien, Ha Nguyen, Quyen Ngo The, Huyen Nguyen Thi Minh, and Linh Ha My. 2019. “Rule
Based English-Vietnamese Bilingual Terminology Extraction from Vietnamese
Documents.” In Proceedings of the 10th International Symposium on
Information and Communication
Technology, 56–62. Association for Computing Machinery.
Trtovac, Aleksandra. 2017. Pronalaženje informacija u digitalnim bibliotekama [Information retrieval in digital
libraries], Beograd: Univerzitetska biblioteka “Svetozar Marković”, pp. 219, ISBN 978-86-7301-103-5.
Utvić, Miloš, Ranka Stanković, and Ivan Obradović. 2008. “Integrisano
okruženje za pripremu paralelizovanog korpusa.” Die Unterschiede zwischen dem Bosnischen/Bosniakischen, Kroatischen und Serbischen, pp. 563–578.
Venkatasubramanian, Mani, and Kevin Tomsovic. 2004. “Power
System Analysis.” Edited by Wai-Kai Chen. The
Electrical Engineering handbook (Elsevier Academic Press), 761–778.
Vu, Thuy, Aiti Aw, and Min Zhang. 2008. “Term
extraction through Unithood and Termhood Unification.” In Proceedings
of the Third International Joint Conference on Natural Language
Processing,
3
1.
Šandrih, Branislava, Cvetana Krstev, and Ranka Stanković. 2020. “Two
Approaches to Compilation of Bilingual Multi-Word Terminology Lists from Lexical
Resources.” Natural Language
Engineering 26 (4): 455–479.
Šipka, Danko. 2006. Osnovi
leksikologije i srodnih disciplina. Novi Sad: Matica srpska
Dictionaries
Arsenijević, Nada S. 1971. Elektrotehnički rečnik:
nemačko-srpskohrvatski. Elektrotechnisches Deutsch–Serbokroatisches Wörterbuch. [German–Serbocroatian Dictionary of electrical
engineering] Beograd: BIGZ.
Dabac, Vlatko. 1952. Deutsch-kroatisches und kroatisch-deutsches elektrotechnisches Wörterbuch. Njemačko-hrvatski i hrvatsko-njemački
elektrotehnički rječnik [German–Croatian and Croatian–German Dictionary of
electrical engineering]. Zagreb: Izdavačko preduzeće “Školska knjiga”.
Dragović, Ivan, Milan Pavićević, and Petar Vujačić. 1971. Rečnik industrijske elektrotehnike: nemačko – srpskohrvatski [Dictionary of electrical engineering in
industry]. Beograd: Privredni pregled.
Duncan, Katarina, Ivandekić, Mirko Ivković. 1997. Statistička terminologija korišćena u elektroprivredi / [Terminologie Utilisée Dans Les Statistiques de L’industrie Électrique / Statistical Terminology Employed in the Electricity Supply
Industry]. Beograd: Elektroenergetski koordinacioni centar.
Electropedia: The World’s Online Electrotechnical Vocabulary. IEC 60050 –
International Electrotechnical Vocabulary. International Electrotechnical
Commission. Accessed August
29, 2018. [URL]
Elektrotehnički terminološki rečnik: srpskohrvatski, slovenački,
makedonski, ruski, engleski, francuski, nemački. Grupa-07. Elektronika, [Electrotechnical Vocabulary: Serbocroatian, Slovenian, Macedonian, Russian, English, French, German. Group 07 –
electronics]. Edited by Toma Jovanović. Beograd: Tehnička knjiga, 1965.
Evronim – Multilingual Terminology Database. Евроним – вишејезична терминолошка
база. Ministry of European Integration of the Republic of
Serbia. Accessed March, 6th
2021 [URL]
Ilić, Evgenija. 1969. Rusko-srpskohrvatski i srpskohrvatsko-ruski elektrotehnički rečnik [Russian–Serbocroatian and Serbocroatian–Russian Dictionary of electrical
engineering]. Zagreb: Tehnička knjiga.
Marković, Jelica. 1986. Englesko – srpskohrvatski elektrotehnički rečnik [English–Serbocroatian Dictionary of electrical
engineering]. Beograd: JCTND.
Marković, Jelica. 1992. Englesko – srpskohrvatski tehnički rečnik sa izgovorom [English–Serbocroatian Technical dictionary with
pronunciation]. Beograd: Tehnička knjiga.
Međunarodni elektrotehnički IEC rečnik sa terminima na srpskom jeziku:
elektroenergetika, elektronika i telekomunikacije. Knjiga 1, engleski, francuski, ruski, nemački, španski, italijanski,
holandski, poljski, švedski [International Electrotechnical Vocabulary with terms
in Serbian: power engineering, electronics and telecommunication, Book 2 1, English, French, Russian, German, Spanish,
Italian, Dutch, Polish, Swedish]. Edited by Radoslav Horvat. Beograd: Savezni zavod za standardizaciju, 1996.
Međunarodni elektrotehnički IEC rečnik sa terminima na srpskom jeziku:
elektroenergetika, elektronika i telekomunikacije. Knjiga 2, francuski, engleski, ruski, nemački, španski, italijanski,
holandski, poljski, švedski [International Electrotechnical Vocabulary with terms
in Serbian: power engineering, electronics and telecommunication, Book 2, French, English, Russian, German, Spanish, Italian,
Dutch, Polish, Swedish]. Edited by Radoslav Horvat. Beograd: Savezni zavod za standardizaciju, 1996.
Međunarodni elektrotehnički IEC rečnik sa terminima na srpskom jeziku:
elektroenergetika, elektronika i telekomunikacije, Knjiga 3, srpski, francuski, engleski [International Electrotechnical Vocabulary with terms in Serbian: power engineering, electronics and
telecommunication, Book 3, Serbian, French and English]. Edited by Radoslav Horvat. Beograd: Savezni zavod za standardizaciju, 1996.
Međunarodni elektrotehnički IEC rečnik sa terminima na srpskom jeziku:
elektroenergetika, elektronika i telekomunikacije, Knjiga 4, ruski/srpski i srpski/ruski [International Electrotechnical Vocabulary with terms in Serbian: power engineering, electronics and
telecommunication, Book 4, Russian–Serbian and Serbian–Russian]. Edited
by Radoslav Horvat. Beograd: Savezni zavod za standardizaciju, 1997.
Međunarodni elektrotehnički IEC rečnik sa terminima na srpskom jeziku:
elektroenergetika, elektronika i telekomunikacije, Knjiga 5, nemačko/srpski i srpsko/nemački. [International Electrotechnical Vocabulary with terms in Serbian: power engineering, electronics and
telecommunication, Book 5, German–Serbian and Serbian–German]. Edited
by Radoslav Horvat. Beograd: Savezni zavod za standardizaciju, 1997.
Mirković, Mirko Dobr. 1927. Mali
srpsko-francusko-nemački elektrotehnički rečnik: sa skraćenicama i praktičnim zabeleškama. [Small Serbian–French–German Dictionary of electrical engineering: with abbreviations and practical
notes]. Beograd: Grafički institut “Narodna misao”.
Oksfordski rečnik
računarstva, Beograd: Nolit, 1990 (translation: Dictionary of Computing, Oxford).
Rašović, Miljan. 1991. Pojmovnik rečnik elektrotehnike na pet jezika: srpski, ruski, francuski, engleski, nemački. Definicija pojmova
iz elektrotehnike na osnovu IEC [Glossary of electrical engineering in five
languages: Serbian, Russian, French, English, German. Definition of terms based on International Electrotechnical
Commision]. Beograd: Sfairos.
Slovenski elektrotehniški slovar. Skupina 05. Osnovne
definicije [Slovenian Electrotechnical Vocabulary. Part 05. Fundamental
definitions]. Edited by y France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1957.
Slovenski elektrotehniški slovar. Skupina 10. Stroji in
transformatorji [Slovenian Electrotechnical Vocabulary. Part 10. Machines and
transformers]. Edited by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1958.
Slovenski elektrotehniški slovar. Skupina 07.
Elektronika [Slovenian Electrotechnical Vocabulary. Part 07.
Electronics]. Edited by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1959.
Slovenski elektrotehniški slovar. Skupina 15. Stikalne plošče in
aparati [Slovenian Electrotechnical Vocabulary. Part 15. Switchboards and
apparatus for connection and regulation]. Edited by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1960.
Slovenski elektrotehniški slovar. Skupina 11. Statični pretvorniki.
Skupina 12. Magnetni transduktorji [Slovenian Electrotechnical Vocabulary. Part
11. Static convertors. Part 12. Transductors]. Edited by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1961.
Slovenski elektrotehniški slovar. Skupina 20. Merilni
instrumenti [Slovenian Electrotechnical Vocabulary. Part 20. Scientific and
industrial measuring instruments]. Edited by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1961.
Slovenski elektrotehniški slovar. Skupina 35. Elektromehanska uporaba
električne energije. Skupina 40. Elektrotermija. [Slovenian Electrotechnical
Vocabulary. Part 35. Electromechanical applications. Part 40. Electro-heating
applications]. Edited by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1962.
Slovenski elektrotehniški slovar. Skupina 16. Zaščitni
releji [Slovenian Electrotechnical Vocabulary. Part 16. Protective
relays]. Edited by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1963.
Slovenski elektrotehniški slovar. Skupina 08.
Elektroakustika [Slovenian Electrotechnical Vocabulary. Part 08.
Electro-acoustics]. Edited by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1965.
Slovenski elektrotehniški slovar. Skupina 25, Proizvodnja, prenos in
razdelitev električne energije [Slovenian Electrotechnical Vocabulary. Part 25.
Generation, transmission and distribution of electrical energy]. Edited
by France Mlakar. Ljubljana: Elektrotehniška zveza Slovenije, 1970.
Tankosić, Slobodan. 2006. Rečnik elektronike i elektrotehnike: englesko – srpski [Dictionary
of Electronics and Electrical Engineering:
English–Serbian]. Beograd: Građevinska knjiga.
Vujanić, Milica, Darinka Gortan-Premk, Milorad Dešić, Rajna Dragićević, Miroslav Nikolić, LJiljana Nogo, Vasa Pavković, Milica Radović-Tešić, Nikola Ramić, Rada Stijović and Egon Fekete. 2011. Rečnik srpskoga jezika [Dictionary of the Serbian
language]. Edited by Miroslav Nikolić. Novi Sad, Serbia: Matica srpska.
Cited by (1)
Cited by one other publication
Lefever, Els & Ayla Rigouts Terryn
2024.
Computational Terminology. In
New Advances in Translation Technology [
New Frontiers in Translation Studies, ],
► pp. 141 ff.
This list is based on CrossRef data as of 5 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.