Indian Flag
Government Of India
A-
A
A+
Indian Multilingual Speech Dataset - Project Vaani

Indian Multilingual Speech Dataset - Project Vaani

A large-scale multilingual speech dataset covering 54 languages from 80 districts across India, aimed at developing AI-driven speech recognition (ASR), speech translation (SST), and natural language understanding (NLU) models

About Dataset

VAANI is an India-representative multi-modal multi-lingual dataset. The current version (phase 1- 80 districts) contains ~16,000 hours of spontaenous,image-prompted speech (9.6 Million utterances) by 84.6K speakers across 80 districts, talking about 130K images covering 54 languages. From this audio data, 788.03 hours of transcribed data(text) is available, spanning almost evenly across the 80 districts. Project Vaani, by IISc, Bangalore and ARTPARK, is capturing the true diversity of… See the full description on the dataset page: https://huggingface.co/datasets/ARTPARK-IISc/Vaani.

Activity Overview Activity Overview

  • Downloads 27
  • Views 735
  • File Size 0

Tags Tags

  • ASR
  • Indian Languages
  • NLP
  • Machine Learning
  • Natural Language Processing
  • Multilingual AI
  • Speech Dataset
  • Speech Recognition
  • speech transcription
  • Dataset
  • AI for India
  • Project Vaani

License Control License Control

CC-by-4.0

Version Control Version Control

No Record(s) Found

No Version(s) Found