Article published in:Computational terminology and filtering of terminological information
Edited by Patrick Drouin, Natalia Grabar, Thierry Hamon, Kyo Kageura and Koichi Takeuchi
[Terminology 24:1] 2018
► pp. 7–22
Word embedding dataset from ‘NINJAL Web Japanese Corpus’
In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.
Keywords: word embedding, web corpus, thesaurus, Japanese language
Published online: 31 May 2018
[ p. 20 ]References
Asahara, Masayuki, Kazuya Kawahara, Yuya Takei, Hideto Masuoka, Yasuko Ohba, Yuki Torii, Toru Morii, Yuki Tanaka, Kikuo Maekawa, Sachi Kato, and Hikari Konishi
Asahara, Masayuki, Kikuo Maekawa, Mizuho Imada, Sachi Kato, and Hikari Konishi
Asahara, Masayuki, and Yuji Matsumoto
Baroni, Marco, and Motoko Ueyama
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov
2016 Enriching Word Vectors with Subword Information (https://arxiv.org/pdf/1607.04606.pdf). Accessed 18 January 2018.
2016 Spanish Billion Words Corpus and Embeddings. (http://crscardellino.me/SBWCE/). Accessed 18 January 2018.
Den, Yasuharu, Junichi Nakamura, Toshinobu Ogiso, and Hideki Ogura
Kawahara, Daisuke, and Sadao Kurohashi
Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and Avinesh Pvs
Kokuritsu Kokugo Kenkyusho
Kudo, Taku, and Yuji Matsumoto
Kudo, Taku, Kaoru Yamamoto, and Yuji Matsumoto[ p. 21 ]
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean
2013 “Efficient Estimation of Word Representations in Vector Space.” In Workshop Proceedings of the International Conference on Learning Representations (ICLR), 1–12. Scottsdale, Arizona. (https://arxiv.org/abs/1301.3781). Accessed 18 January 2018.
Morita, Hajime, Daisuke Kawahara, and Sadao Kurohashi
Murawaki, Yugo, and Sadao Kurohashi
2008 “Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints.” In Proceedings of EMNLP 2008 Honolulu, pp. 429–437. (http://www.aclweb.org/anthology/D08-1045). Accessed 18 January 2018.
Pennington, Jeffery, Richard Socher, and Christopher D. Manning
Pomikálek, Jan, and Vít Suchomel
Shinzato, Keiji, Tomohide Shibata, Daisuke Kawahara, Chikara Hashimoto, and Sadao Kurohashi
Srdanović, E. Irena, Erjavec Tomaž, and Adam Kilgarriff
Thomee, Bart, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li
Ueyama, Motoko, and Marco Baroni
2005 “Automated Construction and Evaluation of Japanese Web-based Reference Corpora,” In Proceedings of Corpus Linguistics 2005. Birmingham, UK. (http://clic.cimec.unitn.it/marco/publications/cl2005/Ueyama_Baroni_CL05.pdf https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2005-journal/Thewebasacorpus/UeyamaBaroni2.doc). Accessed 18 January 2018.
2010 nwc-toolkit. (https://code.google.com/archive/p/nwc-toolkit/). Accessed 18 January 2018.[ p. 22 ]
Cited by 4 other publications
Kato, Sachi, Masayuki Asahara, Nanami Moriyama, Asami Ogiwara & Makoto Yamazaki
Ko, Daiki & Koichi Takeuchi
Yoneda, Yoshiki, Yu Suzuki & Akiyo Nadamoto
This list is based on CrossRef data as of 24 april 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.