Skip to content

Agent Team for Notion-Tier Onboarding Polish

A Conductor session spawned an agent team -- QA Tester, Evaluator, and Implementer -- to iteratively polish a product onboarding flow until it matched the quality bar of Notion. The team ran autonomously, filing issues, prioritizing, and fixing them in cycles.

Claude Code Claude Code / claude-opus-4-6 15M tokens 380 messages ~5 hours 34 files
QA Tester (evaluator) Evaluator (prioritizer) Implementer (fixer)

The brief was specific: make the onboarding feel Notion-tier. Not just functional — the kind of polish where every transition feels considered, every loading state is intentional, and the flow feels like it was built by designers who obsess over details. The structure of the flow was fixed. This was purely a quality pass. I set up three specialized agents in a continuous loop and let them run.

The team structure

The QA Tester’s job was to use the onboarding flow as a real user would, then document every moment that felt wrong — not just bugs, but anything that felt abrupt, generic, or unfinished. The Evaluator’s job was to receive that list and assign priority: P0 for things that felt broken (layout shifts, loading flashes, stuck states), P1 for things that felt unpolished (abrupt transitions, missing hover feedback, inconsistent weights), P2 for things that felt generic (default browser focus rings, standard loading spinners, no micro-animation on confirmation). The Implementer’s job was to fix the prioritized list, then hand back to QA.

Cycle by cycle

Cycle one: QA found twelve issues. Button hover states had no animation — they just changed color instantly. Input focus showed the default browser ring instead of a designed one. The progress bar jumped from step to step rather than animating. Loading states showed a full-page spinner that caused a significant layout shift when content loaded. Font weights were inconsistent between step two and step four headers. The reveal step at the end — where the user’s personalized setup was shown — had no moment of delight, just a static list appearing.

The Evaluator classified four of these as P0 (layout shift on load, progress bar jump, font weight inconsistency, the reveal step landing), five as P1 (button hover, input focus, step transitions), three as P2 (minor spacing inconsistencies). The Implementer worked through all P0s and P1s: added Framer Motion shared layout animations for step transitions, built a skeleton loading component shaped like the actual content to eliminate the layout shift, implemented a spring-physics progress bar that eased between step markers, added scale micro-animations on button press (0.97 → 1.0 with spring({ stiffness: 400, damping: 20 })), designed an animated input focus state with a border that grew from center, and added a confetti particle effect on the reveal step.

Cycle two: QA re-tested and found three new issues introduced by the fixes — the skeleton loading component had slightly wrong dimensions and was causing a small layout shift on resolution, the confetti was triggering on every visit to the reveal step including back-navigation, and the spring progress bar was overshooting on slow connections. The Evaluator flagged the first two as P0 regressions. The Implementer corrected the skeleton dimensions, added a hasConfettiFired ref to prevent confetti re-triggering, and tuned the spring constants to eliminate overshoot.

Cycle three: QA found only two minor P2 issues. Both fixed. QA pass came back clean.

Cycle four was the reference comparison. The QA agent navigated both GAIA’s onboarding and Notion’s onboarding, took screenshots of equivalent states, and compared them systematically. Two gaps: the step transition timing was four hundred milliseconds where Notion’s is two hundred fifty, and the input field labels weren’t floating upward on focus the way Notion’s do — they were staying in place and just changing color. The Implementer updated the transition duration and built a floating label animation. Final QA pass confirmed both matched the reference.

The “Notion-tier” benchmark wasn’t aesthetic vagueness — the QA agent made it measurable by actually comparing against Notion’s flow and flagging specific timing and interaction differences. Four cycles, zero manual testing by me, a final onboarding that feels like it was designed by someone who studies the apps they admire.

Hello, World