OpenHathi-Hi-v0.1: Bridging the Language Divide with an Innovative Hindi Language Model

OpenHathi-Hi-v0.1: Bridging the Language Divide with an Innovative Hindi Language Model

Sarvam AI, an Indian startup, has unveiled OpenHathi-Hi-v0.1, a large language model designed to cater specifically to Hindi-speaking users. Surpassing OpenAI's GPT 3.5 Turbo in various Hindi tasks, the new model, built on the seven-billion parameter version of Meta's Llama 2, aims to democratize access to large language models for the vast population in countries like India, Pakistan, Sri Lanka, and Bangladesh.

Key Points:

Multilingual Mastery: OpenHathi-Hi-v0.1 is a versatile language model proficient in Hindi, English, and Hinglish. By extending Llama 2's tokenizer to 48K tokens, the model can incorporate a broader range of languages and specialized vocabularies.

Addressing Language Gaps: While open models like Llama and Mistral have democratized access to large language models, Sarvam highlights their limited support for Indic languages spoken by over 800 million people. OpenHathi-Hi-v0.1 aims to fill this gap and encourage innovation in Indian language AI.

Training Challenges: Training an AI model for Hindi posed unique challenges. Sarvam had to create a dedicated tokenizer, incorporating Romanized Hindi and alternating sentences in Hindi and English during training to enhance the model's proficiency.

Collaborative Efforts: To overcome the lack of training data, Sarvam collaborated with I4Bharat, a research lab at the Indian Institute of Technology Madras, to translate English content to Hindi. This collaboration provided essential language resources and benchmarks for building and testing the model.

Performance Metrics: OpenHathi-Hi-v0.1 outperformed GPT-3.5 and GPT-4 on the FLoRes-200 benchmark for translating Devanagari Hindi to English. While competitive, the model acknowledged limitations, such as susceptibility to catastrophic forgetting.

Accessibility: The model is available for download via Hugging Face, with Sarvam emphasizing its potential for fine-tuning on specific tasks. Enterprise-grade versions are set to launch soon, catering to diverse user needs.

Future Prospects: Sarvam envisions OpenHathi as a catalyst for innovation in Indian language AI and invites developers to build fine-tuned models on top of it. The startup has shared a detailed evaluation process on their YouTube channel.

In essence, OpenHathi-Hi-v0.1 stands at the forefront of linguistic innovation, offering a powerful tool to bridge language divides and promote advancements in Indian language AI. As it paves the way for enterprise-grade applications, the model symbolizes a significant leap towards inclusive language technology.