Home/Datasets/Indian Multilingual Speech Dataset - Project Vaani

Indian Multilingual Speech Dataset - Project Vaani

A large-scale multilingual speech dataset covering 54 languages from 80 districts across India, aimed at developing AI-driven speech recognition (ASR), speech translation (SST), and natural language understanding (NLU) models

About Dataset

VAANI is an India-representative multi-modal multi-lingual dataset. The current version (phase 1- 80 districts) contains ~16,000 hours of spontaenous,image-prompted speech (9.6 Million utterances) by 84.6K speakers across 80 districts, talking about 130K images covering 54 languages. From this audio data, 788.03 hours of transcribed data(text) is available, spanning almost evenly across the 80 districts. Project Vaani, by IISc, Bangalore and ARTPARK, is capturing the true diversity of… See the full description on the dataset page: https://huggingface.co/datasets/ARTPARK-IISc/Vaani.

Dataset Metadata

License

CC-by-4.0

Geographical coverage

India

Sector

Sector Agnostic

Author

ARTPARK-IISc

Source organisation

I-Hub For Robotics and Autonomous Systems Innovation Foundation

Uploaded by

N.A.

AI Ready

Dataset type

Unstructured

Frequency

Time Granularity

Static

Year range

N.A.

Date & Time

18/02/25 12:07:24

Visibility

Open

Activity Overview

License Control

CC-by-4.0

Accessibility options by UX4G

Indian Multilingual Speech Dataset - Project Vaani

About Dataset

Dataset Metadata

Activity Overview

Tags

License Control

Version Control

No Version(s) Found

AIKosha

Resources

Support