The gstack workflow, from plan to retro

Garry Tan, CEO of Y Combinator, open-sourced his personal Claude Code setup. It’s not one skill. It’s eight, and they’re designed to work as a complete development workflow: plan a feature, review the plan, build it, review the code, ship it, QA it, run a retro.

We installed all eight and ran through a real feature build to see how they work in practice. Spoiler alert: it’s awesome. Here’s how it went.

Installing gstack

One command gets you everything:

npx skills add garrytan/gstack

Requires Node.js 18+.

This uses the skills CLI by Vercel, an open-source tool that installs agent skills to your project. It supports Claude Code, Cursor, Codex, and 40+ other editors, automatically placing files in the right location for each tool.

Two skills (browse and setup-browser-cookies) need an extra build step for the headless browser:

cd .agents/skills/gstack && bun install && bun run build && bunx playwright install chromium

You need Bun installed for this. The other six skills work immediately with no setup.

The workflow

gstack maps to a natural development cycle. Here’s the order we used:

  1. plan-ceo-review - Challenge the idea itself
  2. plan-eng-review - Lock in the technical approach
  3. (your agent writes code based on the plans)
  4. review - Pre-landing code review
  5. ship - Automated PR creation
  6. qa - Systematic testing
  7. retro - Weekly retrospective

You don’t have to use all of them, and they work fine individually. But the real value is in the full loop.

What we built

We used gstack to add an email newsletter signup form to the footer of this site (agent-config.com). Small feature, touches a few files, real enough to exercise each skill. We wrote a short plan document describing the goal and let gstack tear it apart.

Step 1: plan-ceo-review

This is the “should we even build this?” check. You give it a plan or feature idea, and it pressure-tests the concept with three modes:

  • Scope expansion: Dream bigger. What would make this 10x better?
  • Hold scope: The plan is accepted. Make it bulletproof.
  • Scope reduction: Cut everything that isn’t essential.

You pick the mode and the skill commits to it. It won’t silently drift toward a different posture mid-review.

What we saw: We ran it in HOLD SCOPE mode. The skill immediately caught that our plan referenced a component that didn’t exist. We’d written “move EmailCapture from the hero section to the footer” but the component was actually called NewsletterSignup.astro and it wasn’t in the hero at all. It was an orphan component not imported anywhere. That kind of error in a plan would have confused any agent trying to implement it.

The review produced a 10-section report covering architecture, error handling, security, data flow, performance, observability, deployment, and long-term trajectory. It mapped 10 error paths (network failure, API timeout, duplicate email, rate limiting, CORS errors), identified that none of them were handled in our plan, and flagged 8 specific corrections needed before implementation could start.

The completion summary table at the end is genuinely useful. Zero critical gaps, 6 diagrams produced, all decisions resolved. It reads like a real engineering review, not just a checklist.

Step 2: plan-eng-review

Once the concept is solid, this skill locks in the execution plan. It focuses on architecture, data flow, edge cases, and test coverage.

The first thing it does is a scope challenge: “What existing code already solves part of this? What’s the minimum change that achieves the goal?” Good starting questions that prevent over-engineering.

This skill is designed for interactive sessions. It presents issues one by one, gives you opinionated recommendations with concrete tradeoffs, and asks for your input before moving on. You pick from three review modes: SCOPE REDUCTION (the plan is overbuilt), BIG CHANGE (walk through each section interactively), or SMALL CHANGE (compressed single-pass review).

It produces ASCII diagrams for data flows and state machines, plus a test coverage map showing what needs unit tests, integration tests, and manual verification.

Step 3: Implement

This is where your AI agent builds the feature. The plans from steps 1 and 2 give it a detailed spec to implement against, and that’s the whole point of the planning skills: by the time you get here, the agent has a clear, reviewed plan with architecture diagrams, edge cases mapped, and test coverage defined.

In the gstack demo, this is where Claude Code takes the approved plan and writes the actual code.

We implemented the email capture based on the CEO review’s findings: inline form in the footer, honeypot bot protection, AbortController with 10-second timeout, distinct UI states for success/error/duplicate, and button disabling during submission.

Step 4: review

The pre-landing review skill analyzes your branch’s diff against main. It reads a checklist file and checks for:

  • SQL safety issues
  • LLM trust boundary violations
  • Race conditions and conditional side effects
  • Other structural problems that tests don’t catch

What we saw: The review found 5 issues in our 3-file diff. Most were low severity or informational, but one was a real bug: we were sending Content-Type: application/json to an API endpoint that expects form-encoded data. That would have silently failed in production. The skill caught it and recommended switching to URLSearchParams.

It also flagged that our honeypot used class="hidden" (easy for bots to detect) and suggested inline positioning instead. And it noted our form had no action attribute for progressive enhancement when JavaScript is unavailable.

We fixed all three issues before moving on. The review paid for itself with the Content-Type catch alone.

Step 5: ship

This is the most automated skill. Type /ship and it runs straight through: merge main, run tests, bump the version, update the changelog, commit, push, and create the PR. It only stops for merge conflicts, test failures, or critical review findings.

What we saw: We ran it as a dry run. The skill checked our branch, counted 120 lines changed across 3 files, auto-selected a PATCH version bump, and generated a changelog entry. It would have created a single squashed commit with a conventional commit message plus a Co-Authored-By attribution.

One thing to know: ship expects VERSION and CHANGELOG.md files in your repo. If they don’t exist, it flags the issue and stops. You’d need to bootstrap those files before your first /ship.

The PR description it generates includes a summary, pre-landing review results, eval results (if any prompt-related files changed), and a test plan. Clean and standardized.

Step 6: qa

Four testing modes:

  • Diff-aware (default on feature branches): Analyzes your git diff, identifies affected pages, tests just those
  • Full: Systematic exploration of the entire app
  • Quick: 30-second smoke test
  • Regression: Compare against a saved baseline

This is where the browse binary comes in. The QA skill uses the headless browser to navigate your app, interact with elements, take screenshots, and produce a structured report with a health score.

The QA skill is particularly powerful for web apps. It checks responsive layouts, tests form submissions, handles dialogs, and asserts element states. The diff-aware mode is smart: on a feature branch, it looks at your git diff, figures out which pages are affected, and focuses testing there instead of crawling the entire app.

Step 7: retro

Run /retro at the end of the week for a retrospective. It analyzes commit history, work patterns, and code quality metrics. If you’re on a team, it breaks down per-person contributions with praise and growth areas.

The retro supports multiple time windows (/retro 24h, /retro 14d, /retro 30d) and a comparison mode (/retro compare) that benchmarks the current period against the previous one. It saves JSON snapshots to .context/retros/ for trend tracking over time.

For solo developers, it’s a useful reflection tool. For team leads, the per-person breakdown with specific praise and growth areas could replace or supplement manual retro facilitation.

The browse and setup-browser-cookies skills

These two support the QA workflow. Browse is a fast headless Chromium (~100ms per command) that persists state between calls. Setup-browser-cookies lets you import logged-in sessions from your real browser so you can QA authenticated pages without logging in manually.

They’re the only skills that need compilation, but they’re worth the setup if you do any web development.

What works well

The planning skills are the standout. Plan-ceo-review caught an error in our plan that would have derailed implementation. That alone justified installing the stack. The structured review format (10 sections, completion summary, error registry) forces thoroughness that most developers skip.

The review skill catches real bugs. Our Content-Type mismatch would have shipped to production and silently failed. The review also suggested honeypot and progressive enhancement improvements that made the final implementation more robust.

The ship skill standardizes releases. If you use version files and changelogs, the automated version bumping, changelog generation, and PR creation removes friction from the release process.

Skills are designed for interactive sessions. These work best when you invoke them inside Claude Code and have a conversation. The planning skills especially benefit from back-and-forth: the agent presents an issue, you weigh in, it adjusts.

What to watch for

Your repo needs VERSION and CHANGELOG.md for ship. The ship skill expects these files and will stop if they’re missing. Bootstrap them before your first /ship.

The browse binary requires Bun. If your team doesn’t use Bun, the setup step adds a dependency you might not want. The six non-browser skills work without it.

Interactive skills need interactive sessions. Running these through claude -p (non-interactive mode) works but you lose the back-and-forth dialogue that makes the planning skills most valuable. Use them in a live Claude Code session for the best experience.

The skills are opinionated. They encode Garry Tan’s engineering preferences: DRY code, aggressive testing, ASCII diagrams, explicit over clever. If those preferences don’t match yours, the recommendations might feel off. But opinionated is better than generic.

Get gstack

The official repo: github.com/garrytan/gstack

Each skill is also available on agent-config.com with individual descriptions and install notes:

Install the full stack:

npx skills add garrytan/gstack

Requires Node.js 18+.