Word segmentation granularity in Korean
This paper describes word segmentation granularity in Korean language processing. From a word separated by blank
space, which is termed an eojeol, to a sequence of morphemes in Korean, there are multiple possible levels of word segmentation
granularity in Korean. For specific language processing and corpus annotation tasks, several different granularity levels have
been proposed and utilized, because the agglutinative languages including Korean language have a one-to-one mapping between
functional morpheme and syntactic category. Thus, we analyze these different granularity levels, presenting the examples of Korean
language processing systems for future reference. Interestingly, the granularity by separating only functional morphemes including
case markers and verbal endings, and keeping other suffixes for morphological derivation results in the optimal performance for
phrase structure parsing. This contradicts previous best practices for Korean language processing, which has been the de facto
standard for various applications that require separating all morphemes.
Article outline
- 1.Introduction
- 2.Previous work
- 3.Definition of segmentation granularity
- 3.1Level 1: Eojeols
- 3.2Level 2: Separating words and symbols
- 3.3Level 3: Separating case markers
- 3.4Level 4: Separating verbal endings
- 3.5Level 5: Separating all morphemes
- 3.6Discussion
- 4.Diagnostic analysis
- 4.1Language processing tasks
- Word segmentation, morphological analysis and POS tagging
- Syntactic parsing
- Machine translation
- 4.2Results and discussion
- Conclusion
- Acknowledgement
- Notes
-
References