The growth of ChatGPT has been dramatic over the years. Recently, OpenAI announced that ChatGPT can now hear, see and speak.
The multimodality of ChatGPT has taken a new form.
On November 2023,
OpenAI’s ChatGPT appeared on the internet. Two months after that, with over 100 million users, it attained the title of the fastest-growing consumer software application in history. The nonprofit company saw the opportunity to make a profit, so it did.
The profits came from their freemium service, but most of those profits and funds largely went into paying their bills — thanks to the hungry resource demands of the LLM models.
On March 14, 2023,
The launch of GPT 4 cemented the name of OpenAI in the superintelligence utopia — which became a key player in extending the boundaries of AI and NLP technology further.
Other big companies showed interest too. Everybody started to extend this boundary further. At the same time, most of these tech companies made hefty profits from this revolutionary field of AI.
ChatGPT, which was on life support of billions of dollars from companies like Microsoft — can finally see, hear and talk.
Metaphorically, it is alive.
I.
Voice: When ChatGPT Speaks
Watch this demo video by OpenAI, in which they reveal the new multimodal features inside the ChatGPT app:
This looks like a "Hello World" moment for ChatGPT — and it is alive, thanks to its new multimodal upgrade.
Through voice, users can send instructions to ChatGPT. ChatGPT would then respond in a seemingly natural voice. The new voice feature has very well promoted ChatGPT to a voice assistant. A powerful voice assistant in fact.
“We collaborated with professional voice actors to create each of the voices. We also use Whisper … to transcribe your spoken words into text,” said OpenAI in their annoucement post.
Whispher is a speech recognition system by OpenAI which is trained on 680,000 hours of data.
In the demo shared by OpenAI, the user asks ChatGPT app to tell a bedtime story about a hedgehog — to which it responds by telling a story. It sounds similar to chatGPT — literally sounds — and as reported by ZDNet, it is similar to how voice assistants like Amazon's Alexa function.
Matter of fact, rumours are Alexa is planning to integrate Generative AI like GPT4 to make its voice assistant more reliable and, well, smart.
II.
Image: When AI See
In the demo by OpenAI, the user asked ChatGPT to fix their bike by sending the images of the bike to the app. ChatGPT ‘looked’ at those images, and came up with a solution to fix the bike1.
Things got interesting when ChatGPT was able to correlate the instruction manual and tools and was able to guide the user on how to really fix the bike.2
The image input feature can be helpful in so many different situations: identifying objects, solving a math problem, reading an instruction manual, or (of course) fixing a bike. The ability to see images can greatly improve visual tasks that require analysis.
One interesting application of this feature is leveraged by a Danish startup called Be My Eyes.
Be My Eyes creating technology for over 250 million people who are blind or have low vision since 2012. They are using GPT-4 to aid these differently-abled people, and for this, they developed a GPT-4 powered AI version of their former Virtual Volunteer™ app.
This allows the Be My Eyes App — which is already assisting blind pupils with their challenges — to be better and more reliable.
According to OpenAI, Be My Eyes can benefit a lot of users as they can now interact with an AI assistance that — thanks to the image capability — allows them to know about their surroundings well.
“Image understanding is powered by multimodal GPT-3.5 and GPT-4. These models apply their language reasoning skills to a wide range of images, such as photographs, screenshots, and documents containing both text and images”, says OpenAI in a blog.
III.
Safety: When ChatGPT (tries to) become safe
OpenAI conducted beta testing and "red teaming" to explore and mitigate risks.
This allows ChatGPT to be nearly safe, if not completely.
Not too long ago, OpenAI published a paper describing their testing efforts with GPT-4V. GPT-4V, stemming from the word GPT-4(V)ision, is a GPT-4 model to analyze image inputs provided by the user.
The primary goal, in OpenAI’s own words, was to "gain additional feedback and insight into the real ways people interact with GPT-4V."
The paper gives us a taste of the risks in the multimodal nature of GPT4.
OpenAI's positive evaluation shows that ChatGPT was able to avoid harmful content. It seems to refuse to generate AI images that includes real people. Moreover, the GPT4-V also refused to identify people in images.
However, the negative evaluations show that GPT-4V is still bound to generate disinformation, break CAPTCHAs, or geolocate images.
Building on top it, OpenAI says the following:
“…Tasks such as the ability to solve CAPTCHAs indicate the model’s ability to solve puzzles and perform complex visual reasoning tasks. High performance on geolocation evaluations demonstrate world knowledge the model possesses and can be useful for users trying to search for an item or place” says OpenAI in its GPT-4V(ision) System Card report highlights
Thanks to AI, gone are the days of CAPTCHAs.
OpenAI found one interesting finding. GPT-4V is quite good at refusing of image-based "jailbreaks."
Image jailbreaking is a term that refers to the process of modifying an image generator AI model (midjourney, dalle3, etc) to bypass its built-in limitations or restrictions.
It is a form of hacking (more of tricking) these image models into generating sensitive images, either by exploiting their flaws or by manipulating their inputs.
From the below graph by OpenAI, we see how GPT-4 was able to achieve the jailbreak refusal — with a refusal rate of more than 85%

The graph compares three variations of GPT4: GPT-4 Release, GPT-4V, and GPT-4V + Refusal System. 3
OpenAI also engaged "red teams" to test the model's abilities in scientific domains, such as understanding images in publications, and its ability to provide medical advice given medical images such as CT scans.
So is this reliable? Of course not.
OpenAI's conclusion on this is clear: "We do not consider the current version of GPT-4V to be fit for performing any medical function.”
So the image capability isn’t fully reliable yet. However, it’s a big leap nonetheless.
OpenAI in its blog mentioned that these new features would come slowly — citing safety concerns.
IV.
Where are we landing on the AGI Dreams?
OpenAI's latest additions to ChatGPT are nothing short of remarkable. Multimodality is the path on which OpenAI has to go if it wants to achieve the AGI.
Will it achieves AGI, or not, that is up for debate. How do we know if AGI is here? Frankly, it isn’t even clear to many AI experts themselves.
But in loose terms, we may know what’s AGI: Artificial General Intelligence (AGI) is just a theoretical term that refers to an AI that is at par with humans in terms of their cognitive abilities.
There is one difficulty though, there is no way to pinpoint a certain time in future where we can say — AGI has been achieved.
But taking cues from the past, it seems like anytime a computer outsmarts a human, we get closer to AGI.
Deep Blue beat Kasparov in Chess — AGI is near. AlphaGO beats the world go champion — AGI is near. AI started to outperform humans on various aptitude tests — AGI is near.
AI now seems to outperform humans when it comes to creativity. And now, everybody seems to believe AGI is near.
However, the AGI becomes far whenever we find a fault within these AI systems. Hallucination, misinformation, and bias; you know it. Even when we have the largest and strongest AI model, these caveats form the roadblock on our supposed AGI journey.
To our annoyance, many points out saying that these shortcomings of AI are fundamental, and intrinsic — with no cure.
However, quite interestingly, we have some instances where humans seem not too bad in front of AI after all.
The widely circulated report, which said AI outperformed humans on creativity tests, didn’t show a significant outperformance. AI was certainly at par, but not always the best most of the time. Moreover, the story is pretty interesting in the AlphaGo case. In a dramatic display of ‘revenge’, Kellin Pelrine, who was an American research scientist intern at FAR AI, defeated AlphaGo at Go — by apparently exploiting a weakness in the system.
I feel the multimodality of AI is the way to go if our destination is AGI. And even if we can’t achieve it in the near future, we might get close to AGI.
Integration of voice input and output, image recognition, and a commitment to safety leads to a ChatGPT that is continuously evolving — becoming a more versatile and reliable AI assistant. The ability to make inferences by analysing the surroundings is very close to how humans also learn.
These features open up a world of possibilities, from hands-free interaction to solving visual problems.
Moreover, ChatGPT would soon be capable of searching the internet inside the ChatGPT window.4 These features, as of now, will soon be available to all users and developers. According to OpenAI, it would slowly roll out all the features — with the ChatGPT Plus and Enterprise users as priority.
The Browser functionality — though currently only available for Plus and Enterprise users — would be available for all the users pretty soon, according to a statement by OpenAI.
If multimodality is the path that we all are walking on, then it’s safe to assume — the AGI is near.
Just waiting for the day when people say “See! AI can take jobs of Mechanics”
GPT-4 Release is the original version of GPT-4. GPT-4V is a modified version of GPT-4 that has been trained on a large dataset of values and ethics. GPT-4V + Refusal System is GPT-4V with an additional layer of protection that can detect and reject harmful requests.
However, this isn’t something new, as you could use gpt4 before as well — by either using plugins or using Bing AI Chat.