Word embedding dataset from ‘NINJAL Web Japanese Corpus’
Masayuki Asahara | National Institute for Japanese Language and Linguistics
In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.
2016 “ ‘BonTen’ Corpus Concordance System for ‘NINJAL Web Japanese Corpus’.” In Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, 25–29.
2014 “Archiving and Analysing Techniques of the Ultra-large-scale Web-based Corpus Project of NINJAL, Japan.” Alexandria: The Journal of National and International Library and Information Issues 25 (1–2): 129–148.
Asahara, Masayuki, and Yuji Matsumoto
2003IPADIC version 2.7.0 User’s Manual (in Japanese). Nara Institute of Science and Technology, Japan. Information Science Division. Technical Report.
Baroni, Marco, and Motoko Ueyama
2006 “Building General- and Special-purpose Corpora by Web Crawling.” In Proceedings of the 13th NIJL International Symposium, Language Corpora: Their Compilation and Application. Tokyo, Japan, 31–40.
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov
2016Enriching Word Vectors with Subword Information ([URL]). Accessed 18 January 2018.
Cardellino, Cristian
2016Spanish Billion Words Corpus and Embeddings. ([URL]). Accessed 18 January 2018.
Den, Yasuharu, Junichi Nakamura, Toshinobu Ogiso, and Hideki Ogura
2008 “A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation.” In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), 1019–1024, Marrakech, Morocco.
Kawahara, Daisuke, and Sadao Kurohashi
2006 “Case Frame Compilation from the Web Using High-performance Computing.” In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, 1344–1347.
Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and Avinesh Pvs
2010 “A Corpus Factory for Many Languages.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010), Malta, 904–910.
Kokuritsu Kokugo Kenkyusho
1964Word List by Semantic Principles, 1st Edition. Shuei Shuppan, Kokuritsu Kokugo Kenkyusho Shiryo-shu 6.
Kokuritsu Kokugo Kenkyusho
2004Word List by Semantic Principles, Revised and Enhanced Version Dainippon Tosho, Kokuritsu Kokugo Kenkyusho Shiryo-shu 14,
Kudo, Taku, and Yuji Matsumoto
2002 “Japanese Dependency Analysis using Cascaded Chunking.” In Proceedings of CoNLL 2002: Proceedings of the 6th Conference on Natural Language Learning 2002 (COLING 2002 Post-Conference Workshops), 63–69. Taipei, Taiwan.
Kudo, Taku, Kaoru Yamamoto, and Yuji Matsumoto
2004 “Applying Conditional Random Fields to Japanese Morphological Analysis”. In Proceedings of EMNLP 2004. 230–237. Barcelona, Spain.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean
2013 “Efficient Estimation of Word Representations in Vector Space.” In Workshop Proceedings of the International Conference on Learning Representations (ICLR), 1–12. Scottsdale, Arizona. ([URL]). Accessed 18 January 2018.
Morita, Hajime, Daisuke Kawahara, and Sadao Kurohashi
2015 “Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model.” In Proceedings of EMNLP 2015. 2292–2297. Lisbon, Portugal.
Murawaki, Yugo, and Sadao Kurohashi
2008 “Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints.” In Proceedings of EMNLP 2008 Honolulu, pp. 429–437. ([URL]). Accessed 18 January 2018.
Murawaki, Yugo, and Sadao Kurohashi
2010a “Online Japanese Unknown Morpheme Detection using Orthographic Variation.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010). 832–839. Malta.
Murawaki, Yugo, and Sadao Kurohashi
2010b “Semantic Classification of Automatically Acquired Nouns using Lexico-Syntactic Clues.” In Proceedings of COLING 2010. 876–884. Beijing, China.
Pennington, Jeffery, Richard Socher, and Christopher D. Manning
2014 “GloVe: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1532–1543.
Pomikálek, Jan, and Vít Suchomel
2012 “Efficient Web Crawling for Large Text Corpora.” In Proceedings of the Seventh Web as Corpus Workshop (WAC7), 39–43. Lyon, France.
2008 ‘TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access.” In Proceedings of Third International Joint Conference on Natural Language Processing (IJCNLP2008), Hyderabad, India, 189–196.
Srdanović, E. Irena, Erjavec Tomaž, and Adam Kilgarriff
2008 “A Web Corpus and Word-sketches for Japanese.” Shizen gengo shori (Journal of Natural Language Processing) 15 (2): 137–159.
Thomee, Bart, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li
2016Yfcc100m: The New Data in Multimedia Research 591: 64–73.
Ueyama, Motoko, and Marco Baroni
2005 “Automated Construction and Evaluation of Japanese Web-based Reference Corpora,” In Proceedings of Corpus Linguistics 2005. Birmingham, UK. ([URL][URL]). Accessed 18 January 2018.
Yata, Susumu
2010nwc-toolkit. ([URL]). Accessed 18 January 2018.
Cited by
Cited by 6 other publications
Asahara, Masayuki
2019. Surprisal through Word Embeddings. Journal of Natural Language Processing 26:3 ► pp. 635 ff.
2021. Opposite Information Annotation on ‘Word List by Semantic Principles’. Journal of Natural Language Processing 28:1 ► pp. 60 ff.
Ko, Daiki & Koichi Takeuchi
2020. Evaluation of Embedded Vectors for Lexemes and Synsets Toward Expansion of Japanese WordNet. In Computational Linguistics [Communications in Computer and Information Science, 1215], ► pp. 79 ff.
Nie, Xiaozhe, Zhijie Xu, Jianqin Zhang & Yu Tian
2023. Attention-Based Personalized Compatibility Learning for Fashion Matching. Applied Sciences 13:17 ► pp. 9638 ff.
Omura, Mai, Aya Wakasa & Masayuki Asahara
2023. Universal Dependencies for Japanese Based on Long-Unit Words by NINJAL. Journal of Natural Language Processing 30:1 ► pp. 4 ff.
Yoneda, Yoshiki, Yu Suzuki & Akiyo Nadamoto
2019. 2019 International Conference on Data Mining Workshops (ICDMW), ► pp. 441 ff.
This list is based on CrossRef data as of 8 april 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.