This Open-Source Phone AI Agent Sees, Hears and Acts—All Without Touching the Cloud

Oppo's X-OmniClaw runs directly on your Android device, using the camera, screen, and microphone to execute real tasks inside real apps.

By Jose Antonio Lanz

Edited by Guillermo Jimenez

May 18, 2026

4 min read

Artificial intelligence. Image: Shutterstock/Decrypt

Add on Google

In brief

X-OmniClaw is an open-source Android AI agent from Oppo that keeps its core logic on-device and only calls the cloud for high-level reasoning.
The framework builds a long-term semantic memory from your photo gallery and session history, letting it act as a continuous assistant rather than a one-shot chatbot.
A behavior cloning feature lets users record a navigation path once so the agent can replay it instantly via Android deeplink, bypassing multi-step app navigation in future sessions.

Your phone already has a camera, a microphone, and a screen. It can see what you're looking at in real life and what's happening on its own display. And now, the AI team from Chinese smartphone manufacturer Oppo has figured out that all that hardware that sits there, mostly underused, is exactly what you need to build a genuinely useful mobile AI agent.

That project is X-OmniClaw, published by the Multi-X Team. It's an open-source AI agent framework for Android that turns your phone into a hands-free, context-aware assistant capable of running real tasks across real apps, without routing everything through a cloud copy of your device.

Most mobile AI systems don't actually run on your phone. They run on cloud servers that host virtual copies of Android, letting an AI tap and scroll through apps remotely. The result: no access to your real camera, your actual photos, or your local files—just a stranger using a copy of your phone.

X-OmniClaw takes the opposite approach. Per the technical report, it introduces "an edge-native architecture that executes directly on the user's physical device, thereby eliminating the gap between simulated environments and real-world interaction contexts."

The report uses a car analogy: The smartphone is "the vehicle," X-OmniClaw is "the internal engine for control and perception," and the cloud-based language model is only called in as "the fuel" when heavy reasoning is needed. Everything else stays local.

How the Oppo AI phone agent works

X-OmniClaw's overall architecture is based on three pillars: Omni Perception, Omni Action, and Omni Memory that work as one continuous loop, with cloud LLMs called in only for heavy reasoning, according to Oppo.

Oppo's X-OmniClaw Agent technology — Source: OPPO AI Center

Omni Perception covers everything the phone can sense. It combines camera feeds, screen content, and voice input into a single pipeline. A vision-language model interprets the scene before the agent does anything else. So if you point your camera at a bottle and ask, "how much does this cost?", the agent first figures out what you're looking at, then opens the relevant shopping app and starts searching. No guessing required.

Omni Memory is what separates X-OmniClaw from a one-shot chatbot. The agent maintains context across tasks, app switches, and sessions. It also builds a long-term semantic memory from your photo gallery, turning raw images into structured notes about objects, scenes, and events. The report states "runtime continuity is what lets X-OmniClaw operate as an ongoing device agent rather than a one-shot response system."

Omni Action handles execution. It combines XML interface data with an on-device visual model and OCR—a character-recognition layer to figure out exactly what to tap, even on ad-heavy screens where structure alone isn't enough. It also includes behavior cloning: record yourself navigating to a buried app page once, and the agent can replay that route instantly using an Android deeplink shortcut next time.

What the Oppo AI agent can actually do

Oppo shared some things the model can do. For example, the agent identifies a physical product via camera, opens Taobao, scrolls results, and returns a price summary—no typing required.

Oppo also demoed a floating on-screen companion that helps a user work through math exercises step by step: autonomously reading the screen, processing each question, and advancing when done.

It also offered another example in which a user asks the agent to assemble a highlight video from parrot-themed photos. The system scans the gallery, finds matching photos using its semantic memory, opens CapCut's video editor via deeplink, batch-selects the files, and generates the video. What used to take "a few minutes or longer" becomes a handful of automated steps.

2026: The year of agentic AI

AI agents have become one of the most discussed categories in tech. OpenClaw—the open-source agent framework that reached over 373,000 GitHub stars and was eventually backed by OpenAI—launched the current wave by showing what persistent, locally-run agents could do on PCs. Hermes Agent by Nous Research took things further with a self-improving learning loop that compounds capabilities over time.

Both run primarily on desktop hardware. X-OmniClaw extends the same architecture to the device you actually carry everywhere. The team built on the open-source HermesApp codebase, and the paper explicitly credits OpenClaw's structured skill model as foundational inspiration, then adapted it for the multimodal, always-on nature of a smartphone.

The code is on GitHub now. Oppo says it will release all assets and keep updating the project as the system evolves.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Coin Prices