Indian Flag
Government Of India
A-
A
A+

Bhashini-AI4Bharat Textual Language Detection v1.0

Detect language from provided text, Currently supports 23 languages (English, Bangla, Manipuri, Bodo, Konkani, Oriya, Nepali, Marathi, Sindhi, Sanskrit, Malayalam, Urdu, Assamese, Telugu, Dogri, Gujarati, Kashmiri, Punjabi, Santali, Maithili, Hindi, Tamil, Kannada)

  • Digital India BHASHINI Division
    Digital India BHASHINI Division
  • BHASHINI_shailendra
    BHASHINI_shailendra

About Model

IndicLID, is a language identifier for all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. IndicLID is the first LID for romanized text in Indian languages. It is a two stage classifier that is ensemble of a fast linear classifier and a slower classifier finetuned from a pre-trained LM. It can predict 47 classes (24 native-script classes and 21 roman-script classes plus English and Others). IndicLID is evaluated on Bhasha-Abhijnaanam benchmark which is released alnog with this work. For native-script text, IndicLID has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID model is 10 times faster and 4 times smaller than the NLLB model also establish a strong baseline results on the roman-script text.

Bhashini-AI4Bharat Textual Language Detection v1.0

Metadata Metadata

MIT

AI4Bharat

OCR (Optical Character Recognition) Model

Open

Digital India BHASHINI Division

Sector Agnostic

05/03/25 15:21:43

Admin

3 MB

Activity Overview Activity Overview

  • Downloads 72
  • Views 614
  • File Size 3 MB

Tags Tags

  • Multilingual
  • AI4Bharat
  • NLP
  • Bhashini
  • Text Processing
  • Deep Learning
  • Transformer
  • Text Language Detection
  • Text data

License Control License Control

MIT

Version Control Version Control

FolderVersion 2(3 MB)
  • admin·1 month(s) ago
    • .zip
      IndicLID-master.zip