Open Phone-Use Agents

PhoneBuddy: Training Open Models for Agentic Phone Use

PhoneBuddy studies how to train open phone-use agents with both real-app execution and scalable mock-app environments. The core result is simple: real-app RL provides realism, while PhoneWorld-style mock-app training adds resettable and automatically verified interaction signal.

📄 Paper PDF 📝 arXiv 🤗 HuggingFace Paper · Coming Soon 🤗 HuggingFace Model 💻 GitHub

🤗 PhoneBuddy-4B Real+Mock 🤗 PhoneBuddy-4B-RealApp Real-only 🤗 PhoneBuddy-0.8B Real+Mock

4Bopen phone-use model line

150real-phone human-evaluation tasks

83.2%AndroidWorld success rate after Real+Mock RL

+5.0average gain over real-app RL alone

Phone-Agent Research Line

PhoneBuddy sits in a broader phone-agent stack: environments for scalable interaction, training recipes for open models, runtime harnesses for real execution, and deployment boundaries for privacy and safety.

Phone-Agent Stack

TrainingPhoneBuddy

Training open phone-use models with real-app RL and PhoneWorld-style mock-app training, showing that realism and scalable verified interaction are complementary.

Paper Code Models

EnvironmentPhoneWorld

A scalable pipeline that turns real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, verifiers, and training rollouts.

Paper 中文Blog

RuntimePhoneHarness

A mixed-action phone-agent harness and benchmark that routes across CLI, GUI, and MCP tools with trace-backed verification.

Project Paper Code Dataset 中文 Blog

PrivacyPhonePrivacy

A verifiable benchmark for privacy behavior in mobile agents, auditing permissioned access, minimal disclosure, and user-controlled memory.

Paper 中文Blog

SafetyPhoneSafety

Safety evaluation for phone-use agents, separating true safety behavior from simple incapability and checking risky mobile side effects.

Paper Code

Results

PhoneBuddy-4B-Real+Mock improves the same open-model line across real-phone and AndroidWorld settings. The visual summary below separates main capability comparison from the cross-app limitation.

Capability Profile

The radar compares PhoneBuddy-4B-Real+Mock against GPT-5.4, Gemini 3.1 Pro, and Seed 2.0 across the same five axes. PhoneBuddy is strongest on AndroidWorld and single-app tasks, while cross-app transfer remains the visible weak axis.

PhoneBuddy GPT-5.4 Gemini Seed

Average Success Rate

42.6

PhoneBuddy
SFT

48.2

GPT-5.4

49.8

PhoneBuddy
Real

51.4

Seed 2.0

54.8

PhoneBuddy
Real+Mock

59.1

Gemini
3.1 Pro

Mixed real+mock training is the strongest PhoneBuddy checkpoint and closes much of the gap to larger proprietary agents.

AndroidWorld

60.3

PhoneBuddy
SFT

70.7

GPT-5.4

71.5

Seed 2.0

77.2

PhoneBuddy
Real

80.2

Gemini
3.1 Pro

83.2

PhoneBuddy
Real+Mock

AndroidWorld shows a clean progression from SFT to real-app RL to mixed real+mock RL.

Single-App Tasks

34.0

PhoneBuddy
SFT

44.0

Seed 2.0

50.0

GPT-5.4

50.0

Gemini
3.1 Pro

54.0

PhoneBuddy
Real

62.0

PhoneBuddy
Real+Mock

GPT and Gemini are shown separately; both score 50.0 on this slice.

WeChat Mini-App Tasks

40.0

GPT-5.4

48.0

PhoneBuddy
Real

54.0

PhoneBuddy
SFT

56.0

PhoneBuddy
Real+Mock

58.0

Gemini
3.1 Pro

60.0

Seed 2.0

Mock-app practice helps recover mini-app reliability after real-only training.

Model	Single-App	Cross-App	WeChat Mini-App	AndroidWorld	Avg.
PhoneBuddy-4B-SFT	34.0	22.0	54.0	60.3	42.6
PhoneBuddy-4B-Real	54.0	20.0	48.0	77.2	49.8
PhoneBuddy-4B-Real+Mock	62.0	18.0	56.0	83.2	54.8

Evaluation covers real-phone Single-App, Cross-App, WeChat Mini-App tasks, plus AndroidWorld.

Delta view of real-app RL over SFT and mixed real+mock RL over real-app RL.

Method

PhoneBuddy compares a shared SFT checkpoint, a real-app RL checkpoint, and a mixed real+mock RL checkpoint under the same backbone, action interface, and evaluation protocol.

Real-App Environment

Real devices and authentic apps expose account state, app logic, timing variation, permission flows, and real side effects.

PhoneWorld

Runnable mock apps reconstructed from real GUI usage structure provide resettable training tasks and automatic verification.

Real + Mock RL

The final branch keeps real execution in the loop while adding scalable mock-app interaction for broader and cheaper training signal.

Real and Mock Environments

Real app and mock app environment comparison

The real-app environment anchors training to authentic behavior; the mock-app environment contributes scale, reset, and automatic checking.

Qualitative Examples

Representative successful trajectories show structured workflows where mock-app training improves execution reliability.

Limitation & Future Work: Cross-App Generalization

PhoneWorld-style mock tasks currently emphasize single-app interaction rather than cross-app handoff. The ablation therefore compares only PhoneBuddy checkpoints and shows that mock-domain coverage matters for cross-app workflows.

20.0

PhoneBuddy
Real

22.0

PhoneBuddy
SFT

18.0

PhoneBuddy
Real+Mock