What are the “phonemes” in phoneme-grapheme mappings?: A perspective on the use of databases for lexicon development

Cahill, Lynne

doi:10.1075/wll.20.1.06cah

Article published In:

Orthographic Databases and Lexicons
Edited by Lynne Cahill and Terry Joyce
[Written Language & Literacy 20:1] 2017
► pp. 104–127

What are the “phonemes” in phoneme-grapheme mappings?

A perspective on the use of databases for lexicon development

Lynne Cahill | University of Sussex

The CELEX lexical database (Baayen, Piepenbrock & van Rijn 1995) was developed in the 1990s, providing a database of the syntactic, morphological, phonological and orthographic forms of between 50,000 and 125,000 words of Dutch, English and German. This database was used as the basis for the development of the PolyLex lexicons, which included syntactic, morphological and phonological information for around 3,000 words of Dutch, English and German. Orthographic information was subsequently added in the PolyOrth project. The PolyOrth project was based on the assumption that the underlying, lexical phonological forms could be used to derive the surface orthographic forms by means of a combination of phoneme-grapheme mappings and sets of autonomous spelling rules for each language. One of the complications encountered during the project was the fact that the phonological forms in CELEX were not always genuinely underlying forms which made deriving the orthographic forms tricky. This paper discusses the nature and status of underlying phonological forms, their relation to orthography and the issues of finding this information in databases.

Keywords: phoneme-grapheme mappings, lexical databases, lexicons, underlying phonology, lexical phonology, post-lexical phonology

Article outline

1.Introduction
2.Background
- 2.1Phoneme-grapheme mappings
- 2.2Databases
- 2.3Lexicons
3.CELEX
- 3.1Lemmas and word forms
- 3.2Orthography databases
- 3.3Phonology databases
4.PolyLex and PolyOrth
- 4.1PolyLex
  - Hierarchy construction
  - Automatic extension
- 4.2PolyOrth – orthography from phonology
  - 4.2.1Phoneme-grapheme mappings
5.Problems
6.Discussion
7.Conclusions
Notes
References

Published online: 19 October 2017

https://doi.org/10.1075/wll.20.1.06cah

References (24)

References

Baayen, Harald, Richard Piepenbrock & Hedderik van Rijn. (1995). The CELEX lexical database, Release 2 (CD-ROM). Philidelphia, PA: Linguistic Data Consortium, University of Pennsylvania.

Benesty, Jacob, M. M. Sondhi, Yiteng Huang. (Eds.) (2008). Springer Handbook of Speech Processing, Berlin: Springer.

Brown, Dunstan & Andrew Hippisley. (2012). Network Morphology: A Defaults-based Theory of Word Structure, Cambridge: CUP.

Burnage, Gavin. (1996). CELEX: A Guide for Users, The CELEX lexical database, Release 2 (CD-ROM). Philidelphia, PA: Linguistic Data Consortium, University of Pennsylvania.

Cahill, Lynne. (2001). Semi-automatic construction of multilingual lexicons. Machine Translation Review. Electronic journal available at [URL].

. (1990). Syllable-based Morphology. Proceedings of the 13th International Conference on Computational Linguistics (COLING90), Vol. 31, Helsinki, Finland, August 1990, 48–53.

Cahill, Lynne & Gerald Gazdar. (1999). The PolyLex Architecture: Multilingual Lexicons for Related Languages. Traitement Automatique des Langues 40.21: 5–23.

Cahill, Lynne Carole Tiberius & Jon Herring. (2013). PolyOrth: Orthography, phonology and morphology in inheritance lexicons. Written Language and Literacy 16.21: 146–185.

Carney, Edward. (1994). A survey of English Spelling. London: Arnold.

Carroll, J. & C. Grover. (1989). The derivation of a large computational lexicon of English from LDOCE. In B. Boguraev & E. Briscoe (eds.) Computational Lexicography for Natural Language Processing, 117–134. Harlow, UK: Longman.

Evans, R., P. Piwek, L. Cahill & N. Tipper. (2008). Natural Language Processing in CLIME – a multilingual legal advisory system, Journal of Natural Language Engineering 14:1: 101–132.

Evertz, Martin & Beatrice Primus. (2013). The graphematic foot in English and German. Writing Systems Research, 5.11: 1–23.

Finkel, Raphael & Gregory Stump. (2007). Principal Parts and Morphological Typology. Morphology 17.11: 39–75.

Goldrick, Matthew & Brenda Rapp. (2007). Lexical and post-lexical phonological representations in spoken production, Cognition 1021: 219–260.

Herring, Jon. (2006). Orthography and the lexicon. PhD Dissertation, University of Brighton.

Nerbonne, John. (1998). Linguistic Databases. CSLI (ISBN: 9781575860930)

New, Boris, Christophe Pallier, Marc Brysbaert & Dominic Ferrand. (2004). Lexique 2: A New French Lexical Database, Behavior Research Methods, Instruments, & Computers 361: 516.

Nunn, Anneke. (1998). Dutch Orthography: A systematic investigation of the spelling of Dutch words. The Hague: Holland Academic Graphics.

Rollings, Andrew G. (2004). The spelling patterns of English. Munich: Lincom.

Sampson, Geoffrey. (2015). Writing Systems: A Linguistic Introduction. (2nd Edn.). Stanford: Stanford University Press.

Sproat, Richard. (2012). The Consistency of the Orthographically Relevant Level in Dutch, in Martin Neef, Anneke Neijt & Richard Sproat (Eds) The Relation of Writing to Spoken Language, 35–46. Berlin, Boston: Max Niemeyer Verlag.

. (2000). A computational theory of writing systems. Cambridge: CUP.

Swadesh, Morris. (1934). The Phonemic Principle. Language 10.21: 117–129.

Wells, John C. (1987). Computer coded phonetic transcription. Journal of the International Phonetic Association. 17.21: 94–114.