In a major boost to natural language processing capabilities for Indian languages, IIT Bombay-led BharatGPT group has collaborated with Seetha Mahalaxmi Healthcare (SML) to introduce ‘Hanooman’ – a suite of Indic language models trained across 22 languages. With plans to expand support to over 20 languages, these AI models aim to enable text, speech, video and multimedia generation for diverse applications spanning healthcare, governance, finance and education sectors.
Key Details
- Series of large language models (LLMs) responding in 11 Indian languages presently
- Multimodal AI tools generating text, speech, video in Indian languages
- Size ranging from 1.5 billion to 40 billion parameters
- Working across 4 key areas – healthcare, governance, finance, education
- BharatGPT Ecosystem
- Research consortium of 8 IITs led by IIT Bombay
- Backed by Dept of Science & Technology, SML, Reliance Jio
- Aims to develop India-specific LLMs similar to ChatGPT
- Significance of Indigenous Models
- Mitigate concerns around data privacy, relevance
- Address language diversity in India
- Enhance access and proliferation of AI
LLM Architecture, Training and Applications
Architecture
Hanooman models based on transformer architecture, composed of:
- Embeddings to numerically represent text
- Encoders to Establish Contextual Relationships
- Decoders to Generate Target Text
Self-attention mechanism
It detects correlations across input data. Helps capture longer range dependencies in text.
- Training
- Models trained on large Indian language corpora
- Corpus contains text from diverse sources
- Helps models understand complex concepts, relationships in text
- Applications
- Natural Language Understanding
- Machine Translation
- Question Answering
- Text Summarization
- Sentiment Analysis
Specialized Models – VizzhyGPT and LegalGPT
Apart from the base Hanooman models, customized models have also been developed for specific domains:
VizzhyGPT
- Fine-tuned AI model for healthcare
- Trained on large volumes of Indian medical data
- Applications: Medical chat, lab report assessments etc.
LegalGPT
- Targets legal domain
- Trained on Indian legal data
- Applications: Review of legal contracts, case law analysis etc.
These demonstrate potential of building task-specific LLMs with sufficient domain training data.
Key Data and Figures
Parameter Size of Select Hanooman Models:
| Model | Parameter Size |
| Hanooman 1.5B | 1.5 billion |
| Hanooman 40B | 40 billion |
| Hanooman-Tamil | 2.8 billion |
Languages Covered: Hindi, Tamil, Marathi, Bengali, Telugu, Malayalam, Kannada, Gujarati, Punjabi, Odia, Urdu
Neural Network Architecture
The Hanooman models are based on a transformer neural network architecture. This consists of an encoder and a decoder:
- Encoder: Breaks down the input text into smaller chunks and draws contextual relationships between words through self-attention mechanisms. Captures dependencies regardless of position in text.
- Decoder: Uses output from encoder to generate target text word-by-word while predicting next word based on all previous words. Helps generate coherent, relevant output text.
The self-attention layer connects all positions of input sequence and computes representation by integrating information from entire sequence. This allows modeling long-range dependencies in text.
Training Techniques
Some key training techniques used with Hanooman models:
- Transfer Learning: The models are first pre-trained on large unlabeled corpora like Wikipedia to obtain general language understanding capabilities. These capabilities are then transferred for fine-tuning on downstream tasks. Reduces compute requirements.
- Self-Supervised Learning: Pre-training tasks formulated to make use of unlabeled data to capture linguistic properties. Example – Masked language modeling task predicts randomly masked words based on context.
- Multitask Learning: Different end tasks like translation, Q&A incorporated into single model enabling sharing of learned representations across tasks. Improves overall performance.
- Curriculum Learning: Model training progresses from simpler to complex topics by gradually increasing dataset difficulty over epochs.
Development of indigenous language models like Hanooman underscores India’s advancements in AI research and applications. As these models evolve further, they shall empower citizens by enhancing access to information in native languages while also presenting new opportunities for innovation. Robust policy frameworks and inter-disciplinary collaboration shall be vital to guide the responsible and ethical development of this technology.
