arxiv:2605.24830

Macaron-A2UI: A Model for Generative UI in Personal Agents

Published on May 24

· Submitted by

Andrew Chen on May 26

#3 Paper of the day

Mind Lab

Upvote

Authors:

Fancy Kong ,

Pony Ma

Abstract

Generative UI models enable personal agents to synthesize dynamic interfaces with lightweight executable actions for enhanced interaction beyond text-only formats.

AI-generated summary

As personal agents evolve to handle complex, user-centric tasks, static plain-text chat is rapidly becoming a bottleneck. Generative UI emerges as the necessary new interface layer, dynamically synthesizing the right controls, options, and state from the interaction context in real time. We present Macaron-A2UI, a model for Generative UI in personal agents. Our goal is to move beyond text-only interaction by enabling agents to generate natural language together with lightweight, executable UI actions for information collection, preference refinement, confirmation, and multi-goal organization. We build a large-scale Generative UI corpus from heterogeneous dialogue sources, introduce A2UI-Bench for controlled evaluation, and train 30B, 235B and 754B models with parameter-efficient LoRA-based supervised fine-tuning followed by reward-driven reinforcement learning. The best Macaron-A2UI model reaches 75.6 overall on A2UI-Bench without explicit schema hints, surpassing the strongest full-schema frontier baseline. We release the models, benchmark, and evaluation protocol to support future work on Generative UI for personal agents.

View arXiv page View PDF Add to collection

Community

anchen1011

Paper submitter 1 day ago

Macaron-A2UI: A Model for Generative UI in Personal Agents

zzy-hugging

1 day ago

Interesting work!

avahal

about 16 hours ago

the fact you can hit 75.6 on a2ui-bench without explicit schema hints is pretty striking. that schema-light training recipe, with loRA-sft followed by reward-driven rl, basically lets the model learn to generate executable ui alongside natural language. i’d love to see an ablation where you cut the rl reward model entirely and rely only on supervised fine-tuning — my hunch is rl is doing most of the heavy lifting for action validity and safety. edge cases where controls differ across apps or safety policies kick in could expose brittleness in the generated widgets. btw, arxivlens had a solid breakdown that helped me parse the method details: https://arxivlens.com/PaperView/Details/macaron-a2ui-a-model-for-generative-ui-in-personal-agents-495-62505cf9 do you plan to publish an ablation on rl vs sft and test true cross-app robustness in a follow-up?

Fancylalala

Paper author about 3 hours ago

Thanks for the thoughtful comment! We include the SFT-only vs. SFT+RL ablation in Fig. 4 on 30B & 235B models. SFT teaches the model the A2UI-specific syntax and basic protocol without over long schema hints in the prompt. RL then further improves the interaction quality according to the reward function designs. This is also consistent with our reward analysis: low-level validity improves quickly under RL, while higher-level L2/L3 interaction quality improves more gradually.

We fully agree with your point about cross-app robustness and safety-policy edge cases. Our current benchmark already includes multiple domains, no-UI cases, depth/width tasks, and visual render checks, but truly cross-app settings where component catalogs, action policies, and safety constraints vary are an important next step. We see that as a natural follow-up direction, likely requiring app-conditioned component/action spaces and stress tests for policy-sensitive controls.