Named entity recognition in Bengali using system combination
Asif Ekbal | Department of Computer Science and Engineering, Indian Institute of Technology Patna, India
Sivaji Bandyopadhyay | Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
This paper reports a voted Named Entity Recognition (NER) system that exploits appropriate unlabeled data. Initially, we develop NER systems using the supervised machine learning algorithms such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). Each of these models makes use of the language independent features in the form of different contextual and orthographic word-level features along with the language dependent features extracted from the Part-of-Speech (POS) tagger and gazetteers. Context patterns generated from the unlabeled data using an active learning method are also used as the features in each of the classifiers. A semi-supervised method is proposed to describe the measures to automatically select effective unlabeled documents as well as sentences from the unlabeled data. Finally, the supervised models are combined together into a final system by defining appropriate weighted voting technique. Experimental results for a resource-poor language like Bengali show the effectiveness of the proposed approach with the overall recall, precision and F-measure values of 93.81%, 92.18% and 92.98%, respectively.
References
Anderson, T.W., & Scolve, S
(
1978)
Introduction to the statistical analysis of data. Houghton Mifflin.

Bikel, D.M., Schwartz, R., & Weischedel, R.M
(
1999)
An algorithm that learns what’s in name.
Machine Learning (Special Issue on NLP), 1–20.

Borthwick, A
(
1999)
A maximum entropy approach to named entity recognition. Ph. D. Thesis, NYU.

Ekbal, A., Naskar, S., & Bandyopadhyay, S
(
2007b)
Named entity recognition and transliteration in Bengali.
Named Entities: Recognition, Classification and Use, Special Issue of Lingvisticae Investigationes Journal, 30(1), 95–114.

Ekbal, A., Haque, R., & Bandyopadhyay. S
(
2008a)
Named entity recognition in Bengali: A conditional random field approach. In
Proceedings of 3rd IJCNLP-08
, pp. 589–594.
Ekbal, A., & Bandyopadhyay, S
(
2008b)
Bengali named entity recognition using support vector machine. In
Proceedings of the Workshop on NER for South and South East Asian Languages (NERSSEAL), IJCNLP-08
, pp. 51–58, India.
Ekbal, A., & Bandyopadhyay, S
(
2008c)
A Web-BASED Bengali news corpus for named entity recognition.
Language Resources and Evaluation Journal, 401, 173–182, Springer.


Ekbal, A., & Bandyopadhyay, S
(
2008d)
Web-based Bengali news corpus for lexicon development and POS tagging.
POLIBITS, An International Journal, 371, 20–29, ISSN: 1870-9044.

Ekbal, A., & Bandyopadhyay, S
(
2008e)
Appropriate unlabeled data, post-processing and voting can improve the performance of NER System. In
Proceedings of the 6th International Conference on Natural Language Processing (ICON-08)
, pp. 234–239, India.
Florian, R., Ittycheriah, A., Jing, H., & Zhang, T
(
2003)
Named entity recognition through classifier combination. In
Proceedings of CoNLL-2003
.

Joachims, T
(
1999)
Making large scale SVM learning practical. In
B. Scholkopf,
C. Burges, &
A. Smola (Eds.),
Advances in Kernel methods-support vector learning. MIT Press.

Kaushik, D.K
(
2000)
Cataloguing of Indic Names in AACR-2. Delhi: Originals. ISBN 81-7536-187-5.

Gali, K., Sharma, H., Vaidya, A., Shisthla, P., & Sharma, D.M
(
2008)
Aggregrating machine learning and rule-based heuristics for named entity recognition. In
Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages
, pp. 25–32.
Lafferty, J., McCallum, A., & Pereira, F
(
2001)
Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In
Proceedings of 18th ICML
, pp. 282–289.
Malouf, R
(
2003)
A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the Sixth Workshop on Natural Language Learning, pp. 49–55. Taipei, Taiwan.
Sha, F., & Pereira, F
(
2003)
Shallow parsing with conditional random fields. In
Proceedings of Human Language Technology
, NAACL.

Shishtla, P.M., Pingali, P., & Varma, V
(
2008)
A character n-gram based approach for improved recall in Indian language NER. In
Proceedings of the IJCNLP- 08 Workshop on NER for South and South East Asian Languages
, 101–108.
Srikanth, P., & Murthy, K.N
(
2008)
Named entity recognition for Telugu. In
Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages
, pp. 41–50.
Vapnik, V.N
(
1995)
The nature of statistical learning theory. Springer.


Yamada, H., Kudo, T., & Matsumoto, Y
(
2002)
Japanese named entity extraction using support vector machine.
Transactions of IPSJ, 43(1), 44–53.

Cited by
Cited by 4 other publications
Banik, Nayan & Md. Hasan Hafizur Rahman
2018.
2018 International Conference on Innovation in Engineering and Technology (ICIET),
► pp. 1 ff.

Das, Soma, Pooja Rai & Sanjay Chatterji
2023.
Deep Level Analysis of Legitimacy in Bengali News Sentences.
ACM Transactions on Asian and Low-Resource Language Information Processing 22:1
► pp. 1 ff.

Mahmood, Ahsan, Hikmat Ullah Khan, Zahoor Ur Rehman, Khalid Iqbal & Ch. Muhmmad Shahzad Faisal
2019.
KEFST: a knowledge extraction framework using finite-state transducers.
The Electronic Library
37:2
► pp. 365 ff.

Mahmood, Ahsan, Hikmat Ullah Khan, Zahoor-ur-Rehman & Wahab Khan
2017.
2017 13th International Conference on Emerging Technologies (ICET),
► pp. 1 ff.

This list is based on CrossRef data as of 20 february 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.