Ask me anything: Inside the race to build desi GPTs

I asked in Hindi: “Mujhe Bengaluru se Mumbai ka return ticket chahiye (I need a return ticket from Bengaluru to Mumbai).”

The bot, which can understand both text and voice inputs, responded by asking for my mobile number, following which it provided me with a one-time password (OTP), asked for my name, travel dates, gender, and coach requirements. It almost lured me into buying a ticket.

I figured out that Ask Disha, which can answer questions in English, Hindi and Gujarati, is a next generation bot, one that uses generative artificial intelligence (GenAI). Systems powered by GenAI can generate a range of content, from text to high quality images and video. For now, apart from train bookings, Ask Disha can help with payments, cancellations and changing boarding stations.

It was developed by Bengaluru-based conversational AI startup CoRover and is based on a local large language model (LLM) called BharatGPT.

LLMs are AI algorithms that use huge datasets to understand and generate content. BharatGPT was trained to understand and process Indian languages and even dialects—today, it is available in more than 14 Indian languages.

 Ankush Sabharwal, cofounder and chief executive officer of CoRover.

View Full Image

Ankush Sabharwal, cofounder and chief executive officer of CoRover.

Then, for the Greater Chennai Police, a division of the Tamil Nadu Police, CoRover has developed a virtual assistant called ‘AI Police’, that can enable citizens to report violations and even facilitate real-time updates on the status of a first information report, in Tamil and English.

Businesses, similarly, can build multilingual virtual assistants simply by adding local content (documents, databases, etc.) and training the model on it, Ankush Sabharwal, co-founder and chief executive officer (CEO) of CoRover, told me.

In short, local language LLMs have arrived in India and BharatGPT is just a case in point. While ChatGPT, the chatbot developed by OpenAI, and most other LLMs in the world are trained predominantly from English databases, companies working on Indian LLMs have the unenvious task of training their systems on languages that aren’t fully digitized. Most digitized databases, today, are in English. That’s no easy task—India is home to more than 400 languages, making it one of the most linguistically diverse countries in the world. But a bunch of startups, and an established corporation, have accepted that challenge. Read on.

Gurnani’s challenge

Nikhil Malhotra, global head of Makers Lab and chief innovation officer at Tech Mahindra.

View Full Image

Nikhil Malhotra, global head of Makers Lab and chief innovation officer at Tech Mahindra.

Nikhil Malhotra, global head of Makers Lab and chief innovation officer at Tech Mahindra, an IT services exporter, just cannot forget the night of 9 June 2023.

At 11:30 pm, C.P. Gurnani, then the CEO and managing director (MD) of Tech Mahindra, called to ask: “Should we, and can we, take up the challenge?”

Makers Lab is the research and development (R&D) wing of Tech Mahindra, and Malhotra, who knew the context of the call, expressed willingness to pick up the gauntlet.

What was the task? Earlier that day, OpenAI CEO Sam Altman had sparked a controversy. He doubted if Indian entrepreneurs could develop a generative pre-trained transformer (GPT)-type of LLM, leading to a social media exchange with Gurnani and Rajan Anandan, the MD of venture firm Peak XV Partners.

Altman later clarified his remark, citing a context misunderstanding, but the remark had already seeded the first thoughts of building India-specific LLMs.

On 10 June, Gurnani posted on X, formerly Twitter: “Challenge accepted.”

In the first phase, we will be creating an LLM for Hindi language and its 40-odd dialects.
—Nikhil Malhotra

Five months later, on 19 December, Gurnani received a birthday and retirement day gift in the form of Project Indus, a Hindi LLM comprising 539 million parameters and 10 billion Hindi tokens. It was released as a beta for testing within the company. Parameters in GenAI models typically refer to the weights in neural networks that are adjusted during training to enable the model to make predictions or decisions based on input data. ChatGPT has 1.5 billion parameters. Tokens, on the other hand, are numerical representations of pieces of words and sub-words that an LLM can understand.

“In the first phase, we will be creating an LLM for Hindi language and its 40-odd dialects, and then move ahead in a phased manner to cover other languages and dialects,” Malhotra told me. He plans to open source the model in the next few months.

Hanooman series

Vishnu Vardhan, founder, Seetha Mahalaxmi Healthcare.

View Full Image

Vishnu Vardhan, founder, Seetha Mahalaxmi Healthcare.

There is yet another BharatGPT that isn’t related to CoRover. Called the BharatGPT group, it is led by Indian Institute of Technology (IIT) Bombay and seven other engineering institutes. Along with Seetha Mahalaxmi Healthcare (SML), a private healthcare company, they plan to release ‘Hanooman’ soon. That’s a suite of Indic language models. The models will cover Hindi, Tamil, and Marathi to begin with, and later expand to more than 20 languages.

Interestingly, Hanooman is supported by Reliance Industries and IT industry body Nasscom.

The Hanooman series AI models have been built using what is called the ‘transformer’ architecture. This architecture is also used by many well-known LLMs like OpenAI’s GPT, Meta’s LLaMA, and Google’s Gemini. The architecture follows an encode-decoder structure where an encoder accepts an input and a decoder generates an output.

Hanooman, a suite of Indic language models, is supported by Reliance Industries and IT industry body Nasscom.

“It cost us about $20 million to build the first model. We will release it this month,” Vishnu Vardhan, founder of SML, told me. He hopes to release “at least four models” under the Hanooman series by the end of March.

Hanooman, according to Vardhan, will initially be a 40 billion parameter foundational model (models that are trained on a broad set of data and can be used for different tasks), atop which all the other models in the series will be built. He plans to open source this foundation model for researchers, academic institutions and startups. SML, Vardhan further told me, is also working with businesses to create smaller models, which will be monetized. One of the first customized versions will be a model fine-tuned for healthcare, one that is trained using medical data.

Airawata arrives

Apart from IIT Bombay, other premier tech institutes have upped their AI game, too.

In 2022, IIT Madras established the Nilekani Center at AI4Bharat, a research lab, to promote Indian language technology. The lab, supported by Rohini and Nandan Nilekani through Nilekani Philanthropies, released ‘Airawata’, a LLM trained on Hindi datasets, in January.

The lab has also partnered with Sarvam AI, a GenAI startup founded by Vivek Raghavan and Pratyush Kumar—both were co-founders of AI4Bharat—to develop LLMs specifically for India called the OpenHathi Series. Sarvam AI, on its part, say it will work with Indian enterprises to co-build domain-specific AI models on their data. It also hopes to use GenAI atop the India stack—Aadhaar, Unified Payments Interface (UPI) etc.—for public applications. “Every enterprise will be impacted by GenAI. Our intent is to work both in the applications space by building GenAI apps on our platform and also build production grade voice-to-voice LLMs this year,” Raghavan told me over phone. “It will be a model that anyone can use as a service.”

Meanwhile, Bangalore-based AI and Robotics Technology Park (Artpark), a non-profit promoted by the Indian Institute of Science, is partnering with Google India to launch an LLM called Project Vaani. While Google plans to collect speech samples from 773 districts, the initiative is currently focused on 80 districts of 10 states. Cloud-based communications startup, Ozonetel, too, is in the fray. Along with Swecha Telangana (Swecha works on bridging the digital divide), it is compiling a Telugu stories dataset, aimed at building a Telugu LLM. About 8,000 students from 20 colleges participated to create 40,000 pages of Telugu content.

 

Krutrim’s claims

In December last year, Bhavish Aggarwal, the founder of Ola Cabs and Ola Electric, announced yet another venture—Krutrim AI.

Aggarwal went on to make several claims about Krutrim, which means ‘artificial’ in Sanskrit. It is “India’s first full-stack AI” solution; it is a GenAI foundational model, built from scratch; it is trained on more than two trillion tokens and is comparable to GPT-4, created by OpenAI; it can understand 20 Indian languages and generate content in 10 Indian languages including Marathi, Hindi, Telugu, Kannada, and Odia.

GPT-4, however, has been trained on more than 13 trillion tokens. OpenAI describes it as a large multimodal model that “while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks”.

Bhavish Aggarwal unveiled Krutrim on 16 December 2023.

View Full Image

Bhavish Aggarwal unveiled Krutrim on 16 December 2023. (OLA)

The ‘Krutrim beta’ version was released on 26 February. Before I could try out the platform, I had to read a disclosure: “Krutrim is continuously learning and evolving with every conversation; always validate important results independently as Krutrim may display inaccurate, harmful or biased information; Krutrim is not equipped to provide advice on sensitive topics. Please consult a professional for critical decisions.”

Krutrim next nudged me to create, learn and discover. When asked questions, it mostly answers in bullet points. Some of the responses are fairly accurate. When I asked ‘Tell me about LiveMint’, the bot responded: “LiveMint is a premium business news publication in India, known for its in-depth reporting and analysis of national and international business news.” These are early days but many users are being dismissive of Krutrim, and believe that Aggarwal has released a half-baked product. For instance, a former Nasa scientist and visiting academic at MIT, Santanu Bhattacharya, posted on X: “Sad affairs at #Indian #startups, where gimmicks like “fastest #unicorn” far overshadow even getting basic things right. #KrutrimAI…fails in basic questions like “winner of Cricket World Cup”.

 

These are early days but many users are being dismissive of Krutrim, and believe that Aggarwal has released a half-baked product.

Expensive and scarce

India-specific LLMs are certainly the need of the hour but the task, like we mentioned earlier, is easier said than done given high computing costs and paucity of good Indian datasets. Many of the 22 official Indian languages do not have digital data, which makes it challenging to build and train an AI model with local datasets.

Bhashini, a unit of the National Language Translation Mission, has so far spent $6-7 million to collect data from different sources, according to its CEO Amitabh Nag. Bhashini has also employed more than 200 people to collect data—text as well as speech—and feed it into the system, following which the data is curated, annotated, and labelled.

Most of Nvidia’s H100s—the market’s most potent GPU chip tailored for AI—have reportedly been cornered by big tech companies.

View Full Image

Most of Nvidia’s H100s—the market’s most potent GPU chip tailored for AI—have reportedly been cornered by big tech companies. ( Getty Images)

According to Malhotra, Tech Mahindra acquires data from various online sources, including Common Crawl, which provides website data. “However, the challenge lies in finding dialect-specific data, as most sites primarily offer data in mainstream languages,” he said.

To address this, Tech Mahindra has established projectindus.in, a portal where people can contribute data in various dialects. Even when you have the data, GenAI systems need to handle what is called ‘hallucination’—generating false or incorrect information. Biases need to be continuously measured, monitored and fixed.

Tech Mahindra has employed a team of people who can annotate the data to remove the biases. It worked on a classification model, outlining nine broad biases such as those pertaining to crime, political views, age, and disabilities among others. “You cannot do this job because of your age. This is an age-based bias,” Malhotra explained to me.

When Project Indus started, Tech Mahindra had almost 200 GB of data. But after the company began cleaning the data, and removing the biases, it was left with only about 114 GB, Malhotra further said.

India does not have the H100 GPUs, which pose a major computing challenge.
—Vishnu Vardhan, founder of SML

Indian innovators will face yet another challenge—the exponential costs of running GenAI systems. Rowan Curran, an analyst from Forrester, estimates the hardware costs of running GPT-3, released in 2020, to be between $100,000 and $150,000 a month.

This excludes other costs such as electricity, cooling, backup, etc. OpenAI’s GPT was in the works for more than six years and cost upwards of $100 million and used an estimated 10,000 graphics processing units (GPUs). Finally, even the GPUs are in short supply today. Most of Nvidia’s H100s—the market’s most potent GPU chip tailored for AI—have reportedly been cornered by big tech companies like Google, Microsoft, and Meta.

“India does not have the H100s, which pose a major computing challenge,” Vishnu Vardhan of SML told me.

Umakant Soni, CEO of Artpark, believes companies will have to create a business model when building LLMs to recoup the money they invest in that task. While the GPU scarcity could ebb over the next couple of years with more supply, expenses will shoot up too.