After seemingly lurking on the sidelines most of last year, Apple is starting to shake things up in the field of artificial intelligence—and open-source AI in particular.
The Cupertino-based tech giant has partnered with the University of Santa Barbara to develop an AI model that can edit images based on natural language, the same way people interact with ChatGPT. Apple calls it Multimodal Large-Language Model-Guided Image Editing (MGIE).
MGIE interprets text instructions provided by users, processing and refining them to generate precise image editing commands. Integrating a diffusion model enhances the process, enabling MGIE to apply edits based on the characteristics of the original image.
Multimodal Large Language Models (MLLMs), which can process both text and images, form the foundation of the MGIE method. Unlike traditional single-mode AIs focusing solely on text or images, MLLMs can process complex instructions and work in a wider range of situations. For example, a model may understand a text instruction, analyze the elements of a specific photo, then take something out of the image and create a new picture without that element.
To perform these actions, an AI system must have different capabilities, including generative text, generative image, segmentation, and CLIP analysis, all in the same process.
The introduction of MGIE brings Apple closer to achieving capabilities akin to OpenAI's ChatGPT Plus, enabling users to engage in conversational interactions with AI models to create customized images based on text input. With MGIE, users can provide detailed instructions in natural language—"remove the traffic cone from the foreground"—which is translated into image editing commands and executed.
In other words, Users can start with a photo of a blonde person and turn them into a ginger just by saying, "make this person a redhead." Under the hood, The model would understand the instruction, segment the person's hair, generate a command like "red hair, highly detailed, photorealistic, ginger tone," and then execute the changes via inpainting.
Apple's approach aligns with existing tools like Stable Diffusion, which is can be augmented with a rudimentary interface for text-guided image editing. Leveraging third-party tools like Pix2Pix, users can interact with the Stable Diffusion interface using natural language commands, witnessing real-time effects on edited images.
Apple’s approach, however, proves to be more accurate than any other similar method.
Besides generative AI, Apple MGIE can perform other conventional image editing tasks like color grading, resizing, rotations, style changes, and sketching.
Why would Apple make it open source?
Apple's open-source forays are a clear strategic move—with a scope beyond mere licensing requirements.
To build MGIE, Apple uses open-source models such as Llava and Vicuna. Due to the licensing requirements of these models, which limit commercial use by big corporate entities, Apple was likely compelled to share its improvements openly on GitHub.
But this also allows Apple to leverage a worldwide pool of developers in a bid to boost its strength and flexibility. This kind of collaboration moves things forward far faster than Apple working entirely on its own, and starting from scratch. In addition, this openness inspires a wider spectrum of ideas and draws diverse technical talent, allowing MGIE to evolve faster.
Engagement by Apple in the open-source community with projects like MGIE also gives the brand a boost among developers and tech enthusiasts. This aspect is no secret, with Meta and Microsoft both heavily investing in open-source AI.
It's possible that releasing MGIE as open-source software will give Apple a head start in setting still-evolving industry standards for AI and AI-based image editing in particular. With MGIE, Apple has likely given AI artists and developers a solid foundation with which to build the next big thing, providing more accuracy and efficiency than what's available elsewhere.
MGIE will certainly make Apple's products better: it wouldn’t be too difficult to synthesize a voice command sent to Siri and use that text to edit a photo on the user's smartphone, computer, or innersive headset.
Technically savvy AI developers can use MGIE right now. Just visit the project’s GitHub repository.
Edited by Ryan Ozawa.