SHOTGUN: converting words into triplets: A hybrid approach to grapheme-phoneme conversion in Dutch

Beeksma, Merijn; Neijt, Anneke; Zuidema, Johan

doi:10.1075/wll.19.2.02bee

Article published In:

Written Language & Literacy
Vol. 19:2 (2016) ► pp.157–188

SHOTGUN: converting words into triplets

A hybrid approach to grapheme-phoneme conversion in Dutch

Merijn Beeksma | Radboud University Nijmegen

Anneke Neijt

Johan Zuidema

Software systems convert between graphemes and phonemes using lexicon-based, rule-based or data-driven techniques. SHOTGUN combines these techniques in a hybrid system which converts between graphemes and phonemes bi-directionally, adds linguistic and educational information about the relationships between graphemes and phonemes and provides estimates about the likelihood that the generated output is correct. We describe the components from which SHOTGUN is built and determine its accuracy by running tests on two data sources, the BasisSpellingBank and CELEX, comparing the results to Nunn’s (1998) rule-based conversion system. SHOTGUN converts phonemes to graphemes and vice versa with precision of 81% and 86% when tested on the BasisSpellingBank, and 80% and 81% when tested on CELEX. SHOTGUN proves to be a powerful new conversion tool.

Keywords: automatic bi-directional grapheme-phoneme conversion, Dutch language, grapheme-phoneme relationship, overlap algorithm, triplet analysis

Published online: 1 June 2017

https://doi.org/10.1075/wll.19.2.02bee

References

Busser, Bertjan, Walter Daelemans & Antal van den Bosch

(1999) Machine learning of word pronunciation: the case against abstraction. In Eurospeech 991: 2123–2126.

Cranshoff, Betty & Johan Zuidema

(2010) Van Dale Basisspellinggids. Utrecht/Antwerpen: Van Dale Lexicografie.

Daelemans, Walter

(1988) grafon: a grapheme-to-phoneme conversion system for Dutch. Proceedings, 12th international conference on computational linguistics ( coling-88), vol. 11: 133–138.

Daelemans, Walter & Antal van den Bosch

(1993) Data-oriented methods for grapheme-to-phoneme conversion. Proceedings of eacl 61: 45–53.

(1997) Language-independent data-oriented grapheme-to-phoneme conversion. In Jan P.H. van Santen, Richard W. Sproat, Joseph P. Olive & Julia Hirschberg (eds.), Progress in speech synthesis, Section 2: 77–89. New York: Springer-Verlag.

(2001) treetalk: memory-based word phonemisation. In Robert I. Damper (ed.) Data-driven techniques in speech synthesis, Chapter 7. Cambridge: mit Press.

Daelemans, Walter, Antal van den Bosch & Ton Weijters

(1996) IGTree: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review 111: 407–423.

Daelemans, Walter & Helmer Strik

(2002) Het Nederlands in taal- en spraaktechnologie: prioriteiten voor basisvoorzieningen. Een rapport in opdracht van de Nederlandse Taalunie. Second version.

Damper, Robert & John Eastmond

(1997) Pronunciation by analogy: Impact of implementational choices on performance. Language and Speech 40(1): 1–23.

Decadt, Bart, Jacques Duchateau, Walter Daelemans & Patrick Wamback

(2002) Memory-based phoneme-to-grapheme conversion. Language and Computers 45(1): 47–61.

Galescu, Lucian & James F. Allen

(2001) Bi-directional conversion between graphemes and phonemes using a joint n-gram model. Proceedings of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis 41: 103–108.

Geeraerts, Dirk

(2002) Groot woordenboek van de Nederlandse taal ( cd-rom, version 1.0 Plus). Utrecht/Antwerpen: Van Dale Lexicografie.

Hamming, Richard

(1950) Error detecting and error correcting codes. Bell System Technical Journal 29(2): 147 – 160.

Heemskerk, Josée & Wim Zonneveld

(2000) Uitspraakwoordenboek. Utrecht: Het Spectrum.

Hoste, Veronique, Steven Gillis & Walter Daelemans

(2000) Machine learning for modeling Dutch pronunciation variation. 10th Meeting on Computational Linguistics in the Netherlands: 73–84. Utrecht Institute of Linguistics OTS.

Jones, Daniel

(1996) Analogical natural language processing. London: ucl Press.

Jongenburger, Willy & Vincent J. van Heuven

(1993) Sandhi processes in natural and synthetic speech. In Vincent J. van Heuven & Louis C.W. Pols (eds.), Analysis and synthesis of speech, strategy research towards high quality text-to-speech-generation: 261–276. Berlin: Mouton de Gruyter.

Levenshtein, Vladimir

(1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8): 707 – 710.

Marchand, Yannick & Robert Damper

(2000) A multistrategy approach to improving pronunciation by analogy. Computational Linguistics 26(2): 195–219.

Nunn, Anneke M.

(1998) Dutch orthography. A systematic investigation of the spelling of Dutch words (Doctoral dissertation). The Hague: Holland Academic Graphics.

Nunn, Anneke M. & Vincent J. van Heuven

(1993) morphon: lexicon-based text-to-phoneme conversion and phonological rules. In Vincent J. van Heuven & Louis C.W. Pols (eds.), Analysis and synthesis of speech, strategy research towards high quality text-to-speech-generation: 77–99. Berlin: Mouton de Gruyter.

Santen, Jan P.H. van, Richard W. Sproat, Joseph P. Olive & Julia Hirschberg

(1997) Progress in speech synthesis, Section 2. New York: Springer-Verlag.

Skousen, Royal

(1989) Analogical modeling of language. Dordrecht: Kluwer Academic Publishers.

Zuidema, Johan & Anneke Neijt

(2012) Verkennend onderzoek naar de wenselijkheid en de haalbaarheid van een verrijking van de Woordenlijst Nederlandse Taal ten behoeve van spellingonderwijs. Nijmegen: Radboud Universiteit Nijmegen. Online available: [URL].

to appear). The BasisSpellingBank – spelling knowledge stored in a lexicon of triplets.

Cited by

Cited by 1 other publications

Zuidema, Johan & Anneke Neijt

2017. The BasisSpellingBank. Written Language & Literacy 20:1 ► pp. 52 ff.

This list is based on CrossRef data as of 15 april 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.