OpenAI has rolled out highly anticipated upgrades that will allow its popular ChatGPT chatbot to interact with images and voices. This launch represents a major step towards OpenAI’s vision for artificial general intelligence that can perceive and process information from multiple modes, not just text.

"We are beginning to roll out new voice and image capabilities in ChatGPT. They offer a new, more intuitive type of interface by allowing you to have a voice conversation or show ChatGPT what you’re talking about," OpenAI said in its official blog post.

https://youtu.be/--khbXchTeE?si=vx3ne9oRgzvJV6ZA

OpenAI said the new ChatGPT-Plus will include voice chat powered by a novel text-to-speech model capable of mimicking human voices, and the ability to discuss images thanks to integration with the company’s image generation models. The new features seem to be part of what is known as GPT Vision (or GPT-V, which is often confused with a theoretical GPT-5) and represent key components of the enhanced multimodal version of GPT-4 that OpenAI teased earlier this year

This upgrade comes right after OpenAI unveiled DALL-E 3, its most advanced text-to-image generator yet. Hailed as "insane” by early testers due to its quality and accuracy, DALL-E 3 can create high-fidelity images from text prompts while understanding complex context and concepts expressed in natural language. It will be built into ChatGPT Plus, a subscription-based service that offers a ChatGPT powered by GPT-4.

The integration of DALL-E 3 and conversational voice chat signifies OpenAI’s push towards AI assistants that can perceive the world more like humans do - with multiple senses. According to the company: “Voice and image give you more ways to use ChatGPT in your life. Snap a picture of a landmark while traveling and have a live conversation about what’s interesting about it.”

Microsoft Fuels the AI Race with OpenAI Integration

OpenAI’s largest backer, Microsoft, is also charging ahead with integrating OpenAI’s advanced generative AI capabilities into its own consumer products. At its recent autumn event, Microsoft announced AI upgrades to Windows 11, Office, and Bing search leveraging models like DALL-E 3 (in image-tweaking programs like Microsoft’s revamped Paint) and Copilot, OpenAI’s programming assistant.

This aligns with Microsoft’s $10 billion plus investment into OpenAI, as it aims to lead the AI assistant race. The debut of Copilot in Windows 11 on september 26 promises to make AI help available across Microsoft’s platforms and devices. Meanwhile, Microsoft 365 Chat applies OpenAI’s natural language prowess to automate complex work tasks.

As previously reported by Decrypt, Microsoft said that the “Microsoft 365 Chat combs across your entire universe of data at work, including emails, meetings, chats, documents and more, plus the web.”

Cautious Steps Towards Responsible AI

However, OpenAI is keenly aware of potential risks with more powerful multimodal AI systems involving vision and voice generation. Impersonation, bias and reliance on visual interpretation are key concerns.

“OpenAI’s goal is to build AGI that is safe and beneficial,” the company wrote in its announcement. “We believe in making our tools available gradually, which allows us to make improvements and refine risk mitigations over time while also preparing everyone for more powerful systems in the future.”

Also, as Decrypt previously reported, OpenAI is assembling a red team to work on ways to prevent harmful consequences due to improper use of its AI products. CEO Sam Altman has also been lobbying around the world for favorable legislation.

OpenAI said that Plus and Enterprise users will have access to these new functionalities over the next two weeks, with plans to expand availability to developers afterwards. And with Google also announcing its own revolutionary multimodal LLM, Gemini, the race to dominate the AI industry is just beginning

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.