Enterprise and Pro users can now show pictures and have voice conversations with ChatGPT

OpenAI announced it will soon roll out image recognition and speech synthesis features for ChatGPT. These features will be accessible to Enterprise and Pro users. Whether you need a bedtime story or help analyzing a chart, ChatGPT's newly expanded abilities will be here to assist.

Ellie Ramirez-Camara

ChatGPT will soon be capable of looking at pictures and talking back to Enterprise and Pro users. The recent addition of voice synthesis capabilities means users can now hold back-and-forth voice conversations with the popular chatbot. Speech recognition was already available in ChatGPT's mobile apps, which meant users could engage via voice with the chatbot, but would still receive a text-based reply to their query. Now, mobile apps are getting the full set of voice features together with image recognition, which will also be available on all other platforms. According to OpenAI's announcement, the new features will be rolled out over the next two weeks.

Suggested applications of voice features for everyday situations include asking ChatGPT for a bedtime story or settling dinner debates. There is a promotional video showcasing ChatGPT's storytelling capabilities, answering questions about the characters on the go. The simplified story about the inner workings of these features is that the voice synthesis is powered by a text-to-speech model, allegedly capable of generating human-like speech from text and a few seconds of sample audio. OpenAI worked with voice actors to give ChatGPT a selection of voices. The announcement also features a selection of texts and voices so readers can get a glimpse of the new features. On the audio recognition side of things, the chatbot is powered by OpenAI's Whisper, which transcribes user audio input into text so it can be processed.

Image recognition features mean that users can now ask ChatGPT about one or more images, or about a particular detail in an image. Suggested applications of this feature include taking pictures of your fridge to ask ChatGPT for help with dinner, troubleshooting simple technical issues like raising a bicycle seat and analyzing complex graphs. The mobile app now includes a highlighting tool so users can draw ChatGPT's attention to a specific part of an image.

The gradual roll-out responds, as is usual, to security concerns. OpenAI is aware that text-to-speech models able to produce natural-sounding speech from text and short sample audio has the potential for misuse including impersonation and fraud. This possibility motivated OpenAI's decision to apply the feature in the controlled setting of Voice Chat for ChatGPT. Outside of the context of their chatbot, OpenAI says it is also collaborating with Spotify to work on a voice translation feature that would allow podcasters to reach a wider audience by translating their podcasts to different languages using their own voices.

The main concern with image recognition is the potential for hallucinations, in addition to the risk that users may rely solely on conversations with ChatGPT in what OpenAI calls "high-stakes domains". To mitigate this sort of risk, OpenAI has limited ChatGPT's ability to analyze and make direct statements about people. Other than that, the company is claiming transparency on their models' limitations and discourages specialized use of the features without further verification. OpenAI also warns that although the model is proficient at transcribing English, it performs poorly with other languages, so using ChatGPT's audio recognition capabilities to this end is advised against.

It is important to note that OpenAI's cautions and recommendations are frequently overshadowed by the hype generated by the marketing strategy of claiming that "ChatGPT can now see, hear, and speak". Following OpenAI's announcement, Dr. Sasha Luccioni, an AI researcher at HuggingFace, posted on X:

It is undeniable that the "AI is the future" hype, together with the anthropomorphization of ChatGPT's new features do contribute to the misleading idea of ChatGPT being able to replace human expertise, especially in technical and "high-stakes domains". However, it is also the case that AI is here to stay, and if ChatGPT's newest features do deliver as promised, they will represent the next big leap forward, not only for OpenAI but for the next generation of AI-powered assistants across all domains.