Mint Explainer: The mercurial rise of India-focused LLMs

Aggarwal, thus, joins the growing ranks of Indian companies that are building large language models (LLMs) trained on Indian languages. The companies include Bhashini–a unit of the national language translation mission by the ministry of electronics and information technology (Meity); Tech Mahindra’s Indus project; AI4Bharat at IIT-Madras; Project Vaani–part of the Bhasha AI project of ARTPARK and the Indian Institute of Science’s pan-India language initiatives; Sarvam AI’s OpenHathi series; and CoRover.ai’s BharatGPT.

Generative AI, or GenAI, refers to the ability of LLM-powered chatbots such as ChatGPT to create new content, including audio, code, images, text, simulations, and videos (hence the term, multimodal). GenAI systems fall under the broad category of machine learning, but unlike traditional ML that can analyse data patterns to make predictions, these systems create entirely new content with the help of ‘prompts’.

That said, can Ola’s Aggarwal do a Google Gemini or OpenAI’s GPT-4? And why is the parent of electric cars and scooters, Ola Electric, and ride-sharing startup, Ola Cabs, dabbling with foundational models, data centres, and silicon chips that require a lot of investment?

What’s Krutrim got to do with Ola?

Aggarwal’s Krutrim announcement comes at a time when the government is set to unveil its AI policy under the India AI programme on 10 January, which will include a policy framework for public-private partnership models on development of AI databases in Indic languages, as well as indigenous compute capacities, according to Union minister of state for IT, Rajeev Chandrasekhar. But the release of Krutrim’s base foundation model also comes at a time when Ola Electric is gearing up to file for an IPO.

Backed by SoftBank, Ola Electric is targeting a valuation of $7-8 billion by early 2024. While that figure’s much higher than the company’s current estimated worth of about $3.6 billion, it’s closer to Ola Electric’s estimated valuation of $7.3 billion as at the end of 2021. 

Ola Electric plans to use the funds raised from the IPO for expanding its electric vehicle business and establishing a dedicated lithium-ion cell manufacturing unit.

Aggarwal has clarified that Krutrim is a “separate business altogether”, and will not “be integrated at a transactional level”. 

“There are some entities that I own 100%—this is under my company, and not part of Ola or Ola Electric’s corporate structure,” he said. Aggarwal did say Krutrim had “some investments into (Ola Electric)”, but did not disclose any details. 

Further, in a presentation, Aggarwal said that all Ola group companies were “already using Krutrim for a lot of their internal workloads, be it customer support, voice and chat, customer sales calls, and for other processes…” 

This clearly implies that Krutrim’s products and services will be cross-sold to enhance the offerings of the group companies.

How’s GenAI used in vehicles?

The use of generative AI in the auto sector is not new. Mercedes-Benz, for instance, recently used ChatGPT to power voice assistants in a beta program available to more than 900,000 vehicles.

Also consider the example of a Formula E electric race car, the GENBETA,  an enhanced GEN3 race car. The GEN3 is the fastest, lightest, electric race car with a top speed of more than 322 kmph, and is used by the 11 teams and 22 drivers in the ABB FIA Formula E World Championship.

Google Cloud provided generative AI to analyse the drivers’ runs. Additionally, experts from McKinsey & Co.’s AI arm, called QuantumBlack, built data and analytics components to create the driver interface that analysed and queried data in real-time using generative AI.

According to Nvidia, generative AI is also enabling new breakthroughs in autonomous vehicle development in research areas including the use of neural radiance field technology to turn recorded sensor data into fully interactive 3D simulations. These digital twin environments, as well as synthetic data generation, can be used to develop, test and validate autonomous vehicles at incredible scale.

Aggarwal’s AI ambitions, however, appear to go far beyond just the auto sector, given that the Ola group’s businesses extend beyond mobility to financial services offerings including payment systems, insurance agents and cloud kitchens.

What’s the plan with Krutrium?

Krutrim’s AI model, according to the company, has been trained on more than 2 trillion tokens (loosely, numerical representation of pieces of words and sub-words that an LLM can understand. For instance, banana is a word, while homework can be split into two words–home and work). While Aggarwal compared Krutrim to GPT4, the latter has been trained on more than 13 trillion tokens. 

That said, the strength of Krutrim may lie in its understanding of 20 Indian languages and generating content in 10 Indian languages, including Marathi, Hindi, Telugu, Kannada, and Odiya. Aggarwal said Krutrim has been “trained on 20 times more Indic tokens than any other model, ensuring a deep understanding of Indian culture, values, and aspirations”.

While there’s a waitlist if you register for the base LLM model at OlaKrutrim, Aggarwal plans to make the “whole platform” available for developers to build application programming interfaces, or APIs, for enterprise applications, in February. Ola also plans to launch Krutrim Pro in the next quarter.

Can Ola afford a Krutrim?

That said, building a foundational model from scratch is an expensive affair. OpenAI’s GPT was in the works for more than six years and cost upwards of $100 million and used an estimated 30,000 graphics processing units (GPUs). Aggarwal has not disclosed any details of his investments, or the costs, in Krutrim so far.

In FY22, Ola derived about 61% of its revenue, or 1,208.6 crore, from its ride-hailing business in India, while posting a loss of 101 crore. Financial services comprised a small part of the revenue. The group posted a consolidated operating revenue of 1,970.4 crore in FY22, rising from 983.2 crore in the year before. Ola’s net losses, though, widened in FY22 to 1,522.33 crore from 1,116.6 crore in the previous year.

That said, since Krutrim is a separate business, Aggarwal may be bootstrapping the venture, given that he has a personal net worth of a little over $1.4 billion. One, however, will have to wait till Aggarwal discloses more details about his investment plans in designing silicon chips and building the LLM ecosystem.

How can GenAI work with regional languages?

The fact remains that even though India is home to more than 400 languages, making it one of the most linguistically diverse countries in the world, most foundation models and LLMs are trained primarily using internet data, which is predominantly English. As per Statista, English was the most popular language for web content, representing nearly 59% of websites as of January this year. Russian ranked second with 5.3% of web content, followed by Spanish with 4.3%.

While one can only but laud the contribution of India’s Centre for Development of Advanced Computing (C-DAC) in developing the country’s multilingual ecosystem over the past three decades, the fact remains that AI models need to be trained using regional languages to bridge the digital divide in countries like India, which is why efforts such as Krutrim make a lot of sense.

Krutrim, on its part, says it will tap Bhashini, whose technology comprises automatic speech recognition, optical character recognition, natural language understanding, machine translation, and text-to-speech. The Bhashini platform, for instance, uses optical character recognition (OCR) to extract text from data of printed materials such as brochures to train AI models in 14 languages. 

But getting local datasets is a challenge, according to the CEO of Bhashini, Amitabh Nag, who pointed out that many of the 22 official Indian languages do not have digital data, which makes it challenging to build and train an AI model. Bhashini has so far spent $6-7 million to collect data from different sources and employed more than 200 people to collect data (text as well as speech) and feed it into the system, following which the data is curated, annotated, and labelled.

What other Indic LLMs are in the works?

  • The ‘Nilekani Center at AI4Bharat’ (named after Nandan Nilekani), launched at the Indian Institute of Technology-Madras in July last year, is building open-source language AI for Indian languages, including datasets, models, and applications. The project is supported by EkStep Foundation, Microsoft’s Research Lab, and the India Development Center.
  • Sarvam AI, a generative AI startup founded by Vivek Raghavan and Pratyush Kumar (both co-founders of AI4Bharat), is developing LLMs specifically for India–the OpenHathi Series. The startup will focus on training AI models to support the diverse set of Indian languages and voice-first interfaces. It will work with Indian enterprises to co-build domain-specific AI models on their data, and also plans to use GenAI atop the India stack (Aadhaar, UPI, Account Aggregator, etc.) “specifically for public-good applications”. Sarvam AI is partnering with AI4Bharat, which has “contributed language resources and benchmarks”.
  • Bangalore-based AI and Robotics Technology Park (ARTPARK) and the Indian Institute of Science are partnering with Google India to launch a large language model called Project Vaani. This is part of the Bhasha AI project of ARTPARK and IISc’s pan-India language initiatives, which includes SYSPIN (Synthesizing Speech in Indian languages) and RESPIN (Recognizing Speech in Indian languages). While Google plans to collect speech samples from 773 districts, the initiative is currently focused on 80 districts of 10 states. It is expected to expand over the next couple of years, with over 150,000 hours of curated speech and 100 million sentences of text in Indian scripts.
  • Cloud-based communications startup Ozontel, too, recently partnered with Swecha Telangana at the Indian Institute of Information Technology-Hyderabad to compile a Telugu stories dataset, aimed at building a Telugu LLM. About 8,000 students from 20 colleges participated to create 40,000 pages of Telugu content.
  • CoRover has launched its own indigenous LLM called BharatGPT, which is available in more than 12 Indian languages in partnership with Bhashini. CoRover Pvt. Ltd currently offers AI Virtual Assistants (chatbots, voicebots, videobots) to organisations including IRCTC, LIC, the Indian Navy (GRSE), Max Life Insurance, and NPCI. The company is hosted on the Google CloudPlatform (GCP), and Google’s Vertex AI is integrated with CoRover’s conversational AI platform, allowing organisations to utilise Google’s AI services.
  • And in another effort in the auto sector, the Mahindra Group said in August that it aimed to construct an indigenous LLM specifically designed to converse in a multitude of Indic languages. In the first phase, the Indus Project targets the inclusion of a remarkable 40 Hindi dialects, paving the way for an ever-expanding roster. Tech Mahindra acknowledges it has “drawn inspiration from ‘Bhashini’… to amass datasets on Indic languages”.