Word segmentation granularity in Korean

Park, Jungyeul; Kim, Mija

doi:10.1075/kl.00008.par

Article published In:

Korean Linguistics
Vol. 20:1 (2024) ► pp.82–112

Word segmentation granularity in Korean

Jungyeul Park | The University of British Columbia

Mija Kim | Kyung Hee University

This paper describes word segmentation granularity in Korean language processing. From a word separated by blank space, which is termed an eojeol, to a sequence of morphemes in Korean, there are multiple possible levels of word segmentation granularity in Korean. For specific language processing and corpus annotation tasks, several different granularity levels have been proposed and utilized, because the agglutinative languages including Korean language have a one-to-one mapping between functional morpheme and syntactic category. Thus, we analyze these different granularity levels, presenting the examples of Korean language processing systems for future reference. Interestingly, the granularity by separating only functional morphemes including case markers and verbal endings, and keeping other suffixes for morphological derivation results in the optimal performance for phrase structure parsing. This contradicts previous best practices for Korean language processing, which has been the de facto standard for various applications that require separating all morphemes.

Keywords: word segmentation granularity, morphological segmentation, agglutinative language, evaluation

Article outline

1.Introduction
2.Previous work
3.Definition of segmentation granularity
- 3.1Level 1: Eojeols
- 3.2Level 2: Separating words and symbols
- 3.3Level 3: Separating case markers
- 3.4Level 4: Separating verbal endings
- 3.5Level 5: Separating all morphemes
- 3.6Discussion
4.Diagnostic analysis
- 4.1Language processing tasks
  - Word segmentation, morphological analysis and POS tagging
  - Syntactic parsing
  - Machine translation
- 4.2Results and discussion
Conclusion
Acknowledgement
Notes
References

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 30 May 2024

https://doi.org/10.1075/kl.00008.par

References

Bikel, Daniel M.

2004 Intricacies of Collins’ Parsing Model. Computational Linguistics, 30(4):479–511.

Black, Ezra, Steve Abney, Dan Flickinger, Claudia Gdaniec, Ralph Grishman, Phil Harrison, Donald Hindle, Robert Ingria, Frederick Jelinek, Judith L. Klavans, Mark Liberman, Mitch Marcus, Salim Roukos, Beatrice Santorini, and Tomek Strzalkowski

1991 A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19–22, 1991, pages 306–311, Pacific Grove, California. DARPA/ISTO. [URL].

Cha, Jeong-Won, Geunbae Lee, and Jong-Hyeok Lee

1998 Generalized Unknown Morpheme Guessing for Hybrid POS Tagging of Korean. In Eugene Charniak, editor, Proceedings of the Sixth Workshop on Very Large Corpora, pages 85–93, Montreal, Quebec, Canada. Morgan Kaufrnann Publisher. [URL]

Chen, Yige, Eunkyul Leah Jo, Yundong Yao, KyungTae Lim, Miikka Silfverberg, Francis M Tyers, and Jungyeul Park

2022 Yet Another Format of Universal Dependencies for Korean. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5432–5437, Gyeongju, Republic of Korea, 101. International Committee on Computational Linguistics. [URL]

Chen, Yige, KyungTae Lim, and Jungyeul Park

2023 Korean Named Entity Recognition Based on Language-Specific Features. Natural Language Engineering, FirstView:1–25.

Choi, DongHyun, Jungyeul Park, and Key-Sun Choi

2012 Korean Treebank Transformation for Parser Training. In Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages, pages 78–88, Jeju, Republic of Korea. Association for Computational Linguistics. [URL]

Choi, Key-Sun, Young S. Han, Young G. Han, and Oh W. Kwon

1994 KAIST Tree Bank Project for Korean: Present and Future Development. In Proceedings of the International Workshop on Sharable Natural Language Resources, pages 7–14, Nara Institute of Science and Technology. Nara Institute of Science and Technology.

Choi, Sanghyuk, Taeuk Kim, Jinseok Seol, and Sang-goo Lee

2017 A Syllable-based Technique for Word Embeddings of Korean Words. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 36–40, Copenhagen, Denmark, 91. Association for Computational Linguistics. [URL].

Chomsky, Noam

1981 Lectures on Government and Binding. Studies in Generative Grammar. Foris Publications, Dordrecht, The Netherlands.

1982 Some Concepts and Consequences of the Theory of Government and Binding. Linguistic Inquiry Monograph 6. The MIT Press, Cambridge, MA. ISBN 9780262030908.

Chun, Jayeol, Na-Rae Han, Jena D. Hwang, and Jinho D. Choi

2018 Building Universal Dependency Treebanks in Korean. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.

Chung, Min-Chung

1998 Les nominalisations d’adjectifs en coréen: constructions nominales à support issda (il y avoir). PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Chung, Tagyoung and Daniel Gildea

2009 Unsupervised Tokenization for Machine Translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 718–726, Singapore. Association for Computational Linguistics. [URL].

Chung, Tagyoung, Matt Post, and Daniel Gildea

2010 Factors Affecting the Accuracy of Korean Parsing. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 49–57, Los Angeles, CA, USA. Association for Computational Linguistics. [URL]

Collins, Michael

1997 Three Generative, Lexicalised Models for Statistical Parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 16–23, Madrid, Spain. Association for Computational Linguistics.

. [URL]

Gross, Maurice

1975 Méthodes en syntaxe. Hermann.

Han, Chung-Hye, Na-Rae Han, Eon-Suk Ko, Martha Palmer, and Heejong Yi

2002 Penn Korean Treebank: Development and Evaluation. In Proceedings of the 16th Pacific Asia Conference on Language, Information and Computation, pages 69–78, Jeju, Korea. Pacific Asia Conference on Language, Information and Computation.

Han, Sunhae

2000 Les predicats nominaux en coreen: Constructions a verbe support hata. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Hong, Jeen-Pyo

2009 Korean Part-Of-Speech Tagger using Eojeol Patterns (M.S. Thesis). Technical report, Changwon National University, Changwon.

Hwang, Byung-sun

2003 A Study on Interpretation of the Korean Tense. The Korean Language and Literature, 79(1):309–346.

Johnson, Mark

1998 PCFG Models of Linguistic Tree Representations . Computational Linguistics, 24 (4):613–632. [URL]

Joshi, Aravind K., Leon S. Levy, and Masako Takahashi

1975 Tree Adjunct Grammars. Journal of Computer and System Sciences, 10(1):136–163.

Jung, Sangkeun, Changki Lee, and Hyunsun Hwang

2018 End-to-End Korean Part-of-Speech Tagging Using Copying Mechanism. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 17(3):19:1–19:8. ISSN 2375-4699.

Kang, Juyeon

2011 Problèmes morpho-syntaxiques analysés dans un modèle catégoriel étendu: application au coréen et au français avec une réalisation informatique. PhD thesis, Université Paris IV – Paris-Sorbonne, Paris, France. [URL]

Kim, Mija and Jungyeul Park

2022 A note on constituent parsing for Korean. Natural Language Engineering, 28(2):199–222.

Klein, Dan and Christopher D. Manning

2003 Accurate Unlexicalized Parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 423–430, Sapporo, Japan. Association for Computational Linguistics.

. [URL]

Ko, Kil Soo

2010 La syntaxe du syntagme nominal et l’extraction du complément du nom en coréen: description, analyse et comparaison avec le français. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst

2007 Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics. [URL].

Lim, Donghoon

2008 The Mood and Modal systems in Korean. Korean Semantics, 26(2):211–248.

2011 Sentence types in Korean. Journal of Korean Linguistics, 60(1):323–359.

Maamouri, Mohamed and Ann Bies

2004 Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pages 2–9, Geneva, Switzerland, 81. COLING. [URL].

McDonald, Ryan, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee

2013 Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 92–97, Sofia, Bulgaria. Association for Computational Linguistics. [URL]

Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini

1993 Building a Large Annotated Corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330. [URL]

Matsuzaki, Takuya, Yusuke Miyao, and Jun’ichi Tsujii

2005 Probabilistic CFG with Latent Annotations. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 75–82, Ann Arbor, Michigan, 61. Association for Computational Linguistics.

. [URL]

Na, Seung-Hoon

2015 Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3):1–10. ISSN 2375-4699.

Nam, Jee-Sun

1994 Classification syntaxique des constructions adjectivales en coréen. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Nho, Yun-Chae

1992 Les constructions converses du coréen : études des prédicats nominaux. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman

2016 Universal Dependencies v1: A Multilingual Treebank Collection. In Luis von Ahn, editor, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), page 1659–1666, Portorož, Slovenia. European Language Resources Association (ELRA). [URL]

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman

2020 Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France, 5. European Language Resources Association. ISBN 979-10-95546-34-4. [URL]

Och, Franz Josef

2003 Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Association for Computational Linguistics.

. [URL]

Oh, Jin-Young and Jeong-Won Cha

2013 Korean Dependency Parsing using Key Eojoel. Journal of KIISE:Software and Applications, 40(10):600–608.

Oh, Jin-Young, Yo-Sub Han, Jungyeul Park, and Jeong-Won Cha

2011 Predicting Phrase-Level Tags Using Entropy Inspired Discriminative Models. In International Conference on Information Science and Applications (ICISA) 2011, pages 1–5, Jeju, Korea. Information Science and Applications (ICISA).

Pak, Hyong-Ik

1987 Lexique-grammaire du coréen : construction à verbes datifs. PhD thesis, Université Paris 7- Denis Diderot, Paris, France. [URL]

Palmer, Martha, Daniel Gildea, and Paul Kingsbury

2005 The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics, 31(1):71–106.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu

2002 Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 71. Association for Computational Linguistics.

. [URL]

Park, Chulwoo

2007 The Grammatical Voice in Korean: an Interface Phenomenon between Syntax and Semantics. Korean Linguistics, 37(1):207–228.

Park, Jungyeul

2006 Extraction automatique d’une grammaire d’arbres adjoints à partir d’un corpus arboré pour le coréen. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Park, Jungyeul and Francis Tyers

2019 A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus. In Proceedings of the 13th Linguistic Annotation Workshop, pages 195–202, Florence, Italy, 81. Association for Computational Linguistics. [URL].

Park, Jungyeul, Daisuke Kawahara, Sadao Kurohashi, and Key-Sun Choi

2013 Towards Fully Lexicalized Dependency Parsing for Korean. In Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013), pages 120–126, Nara, Japan, 111. Assocation for Computational Linguistics. [URL]

Park, Jungyeul, Jeen-Pyo Hong, and Jeong-Won Cha

2016 Korean Language Resources for Everyone. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers (PACLIC 30), pages 49–58, Seoul, Korea. Pacific Asia Conference on Language, Information and Computation. [URL]

Park, Jungyeul, Loïc Dugast, Jeen-Pyo Hong, Chang-Uk Shin, and Jeong-Won Cha

2017 Building a Better Bitext for Structurally Different Languages through Self-training. In Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora, pages 1–10, Taipei, Taiwan, 111. Asian Federation of Natural Language Processing. [URL]

Park, Jungyeul and Mija Kim

2023 A role of functional morphemes in Korean categorial grammars. Korean Linguistics, 19(1):1–30.

Park, Jungyeul, Sejin Nam, Youngsik Kim, Younggyun Hahm, Dosam Hwang, and Key-Sun Choi

2014 Frame-Semantic Web: a Case Study for Korean. In ISWC-PD’14: Proceedings of the 2014 International Conference on Posters & Demonstrations Track – Volume 1272, pages 257–260, Riva del Garda, Italy, 101. International Semantic Web Conference.

Park, Sounnam

1996 La construction des verbes neutres en coreen. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Park, Sungjoon, Jihyung Moon, Sungdong Kim, Won Ik Cho, Jiyoon Han, Jangwon Park, Chisung Song, Junseong Kim, Yongsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, Seongbo Jang, Seungwon Do, Sunkyoung Kim, Kyungtae Lim, Jongwon Lee, Kyumin Park, Jamin Shin, Seonghyun Kim, Lucy Park, Alice Oh, Jung-Woo Ha, and Kyunghyun Cho

2021 KLUE: Korean Language Understanding Evaluation. Technical report, [URL], 51. [URL]

Petrov, Slav and Dan Klein

2007 Improved Inference for Unlexicalized Parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 404–411, Rochester, New York. Association for Computational Linguistics. [URL]

Petrov, Slav, Dipanjan Das, and Ryan McDonald

2012 A Universal Part-of-Speech Tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 2089–2096, Istanbul, Turkey. European Language Resources Association (ELRA). ISBN 978-2-9517408-7-7

Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein

2006 Learning Accurate, Compact, and Interpretable Tree Annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 433–440, Sydney, Australia. Association for Computational Linguistics.

. [URL]

Shin, Kwang-Soon

1994 Le verbe support hata en coréen contemporain : morpho-syntaxe et comparaison. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Song, Hyun-Je and Seong-Bae Park

2020 Korean Part-of-Speech Tagging Based on Morpheme Generation. ACM Transactions on Asian and Low-Resource Language Information Processing (TAL-LIP), 19(3):1–41, 11. ISSN 2375-4699.

Song, Jae Mog

1998 Semantic functions of the non-terminal suffix -te- in Korean: from a typological perspective. Journal of Korean Linguistics, 32(1):135–169.

Straka, Milan, Jan Hajic, and Jana Straková

2016 UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 4290–4297, Paris, France, 51. European Language Resources Association (ELRA). ISBN 978-2-9517408-9-1

Stratos, Karl

2017 A Sub-Character Architecture for Korean Language Processing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 732–737, Copenhagen, Denmark, 91. Association for Computational Linguistics. [URL].

Stratos, Karl, Michael Collins, and Daniel Hsu

2016 Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models. Transactions of the Association for Computational Linguistics, 41:245–257. ISSN 2307-387X. [URL].

Taylor, Ann, Mitchell Marcus, and Beatrice Santorini

2003 The Penn Treebank: An Overview. In Anne Abeillé, editor, Treebanks: Building and Using Parsed Corpora, pages 5–22. Springer Netherlands, Dordrecht. ISBN 978-94-010-0201-1.

Xue, Naiwen, Fei Xia, Fu-dong Chiou, and Marta Palmer

2005 The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering, 11(2):207–238, 61. ISSN 1351-3249.

Yu, Seunghak, Nilesh Kulkarni, Haejun Lee, and Jihie Kim

2017 Syllable-level Neural Language Model for Agglutinative Language. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 92–96, Copenhagen, Denmark, 91. Association for Computational Linguistics. [URL].