Open Phone-Use Agents

PhoneBuddy: Training Open Models for Agentic Phone Use

PhoneBuddy studies how to train open phone-use agents with both real-app execution and scalable mock-app environments. The core result is simple: real-app RL provides realism, while PhoneWorld-style mock-app training adds resettable and automatically verified interaction signal.

📄 Paper PDF 📝 arXiv · Coming Soon 🤗 HuggingFace Paper · Coming Soon 🤗 HuggingFace Model 💻 GitHub
4Bopen phone-use model line
150real-phone human-evaluation tasks
83.2%AndroidWorld success rate after Real+Mock RL
+5.0average gain over real-app RL alone

Phone-Agent Research Line

PhoneBuddy sits in a broader phone-agent stack: environments for scalable interaction, training recipes for open models, runtime harnesses for real execution, and deployment boundaries for privacy and safety.

Phone-Agent Stack
TrainingPhoneBuddy

Training open phone-use models with real-app RL and PhoneWorld-style mock-app training, showing that realism and scalable verified interaction are complementary.

EnvironmentPhoneWorld

A scalable pipeline that turns real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, verifiers, and training rollouts.

RuntimePhoneHarness

A mixed-action phone-agent harness and benchmark that routes across CLI, GUI, and MCP tools with trace-backed verification.

PrivacyPhonePrivacy

A verifiable benchmark for privacy behavior in mobile agents, auditing permissioned access, minimal disclosure, and user-controlled memory.

SafetyPhoneSafety

Safety evaluation for phone-use agents, separating true safety behavior from simple incapability and checking risky mobile side effects.

Results

PhoneBuddy-4B-Real+Mock improves the same open-model line across real-phone and AndroidWorld settings. The visual summary below separates main capability comparison from the cross-app limitation.

AndroidWorld PB 83.2 Average PB 54.8 Mini-App PB 56.0 Cross-App PB 18.0 Single-App PB 62.0

Capability Profile

The radar compares PhoneBuddy-4B-Real+Mock against GPT-5.4, Gemini 3.1 Pro, and Seed 2.0 across the same five axes. PhoneBuddy is strongest on AndroidWorld and single-app tasks, while cross-app transfer remains the visible weak axis.

PhoneBuddy GPT-5.4 Gemini Seed

Average Success Rate

42.6
PhoneBuddy
SFT
48.2
GPT-5.4
49.8
PhoneBuddy
Real
51.4
Seed 2.0
54.8
PhoneBuddy
Real+Mock
59.1
Gemini
3.1 Pro

Mixed real+mock training is the strongest PhoneBuddy checkpoint and closes much of the gap to larger proprietary agents.

AndroidWorld

60.3
PhoneBuddy
SFT
70.7
GPT-5.4
71.5
Seed 2.0
77.2
PhoneBuddy
Real
80.2
Gemini
3.1 Pro
83.2
PhoneBuddy
Real+Mock

AndroidWorld shows a clean progression from SFT to real-app RL to mixed real+mock RL.

Single-App Tasks

34.0
PhoneBuddy
SFT
44.0
Seed 2.0
50.0
GPT-5.4
50.0
Gemini
3.1 Pro
54.0
PhoneBuddy
Real
62.0
PhoneBuddy
Real+Mock

GPT and Gemini are shown separately; both score 50.0 on this slice.

WeChat Mini-App Tasks

40.0
GPT-5.4
48.0
PhoneBuddy
Real
54.0
PhoneBuddy
SFT
56.0
PhoneBuddy
Real+Mock
58.0
Gemini
3.1 Pro
60.0
Seed 2.0

Mock-app practice helps recover mini-app reliability after real-only training.

ModelSingle-AppCross-AppWeChat Mini-AppAndroidWorldAvg.
PhoneBuddy-4B-SFT34.022.054.060.342.6
PhoneBuddy-4B-Real54.020.048.077.249.8
PhoneBuddy-4B-Real+Mock62.018.056.083.254.8
PhoneBuddy benchmark coverage

Evaluation covers real-phone Single-App, Cross-App, WeChat Mini-App tasks, plus AndroidWorld.

PhoneBuddy RL delta chart

Delta view of real-app RL over SFT and mixed real+mock RL over real-app RL.

Method

PhoneBuddy method overview

PhoneBuddy compares a shared SFT checkpoint, a real-app RL checkpoint, and a mixed real+mock RL checkpoint under the same backbone, action interface, and evaluation protocol.

Real-App Environment

Real devices and authentic apps expose account state, app logic, timing variation, permission flows, and real side effects.

PhoneWorld

Runnable mock apps reconstructed from real GUI usage structure provide resettable training tasks and automatic verification.

Real + Mock RL

The final branch keeps real execution in the loop while adding scalable mock-app interaction for broader and cheaper training signal.

Real and Mock Environments

Real app and mock app environment comparison

The real-app environment anchors training to authentic behavior; the mock-app environment contributes scale, reset, and automatic checking.

Qualitative Examples

PhoneBuddy success cases

Representative successful trajectories show structured workflows where mock-app training improves execution reliability.

Limitation & Future Work: Cross-App Generalization

PhoneWorld-style mock tasks currently emphasize single-app interaction rather than cross-app handoff. The ablation therefore compares only PhoneBuddy checkpoints and shows that mock-domain coverage matters for cross-app workflows.

20.0
PhoneBuddy
Real
22.0
PhoneBuddy
SFT
18.0
PhoneBuddy
Real+Mock