@htmlbin/cli — working notes

An honest pass at the problem before we declare it solved. What we're trying to enable, the shape of a candidate solution, where it works, where it breaks, and what's still an open call.

the problem

@engineer in an org with private repos.
Opens a PR with a frontend change.
Wants their reviewer to see the rendered result, not just the diff.
Doesn't want to send the link outside the company. Doesn't want to add another paid SaaS.
Wants this to happen automatically, every PR, gated to org members.

"PR-preview deploys" is a known shape — most modern hosting providers offer some flavor of it. But those are external services with their own data flows, their own billing, and their own ideas about who can see what. For some orgs that's fine. For others — regulated, security-conscious, or just "we already pay for this infra" — it isn't.

The narrow question we're trying to answer: can we get the same end-user experience using only the infrastructure the org already has?

the user journey we'd want to enable

Concretely, what the engineer should experience:

Pushes commits to a branch, opens a PR — exactly as they do today.
Within a minute, a sticky comment appears on the PR: "🔍 Preview: <url>"
Clicks the URL. Browser gets redirected through the org's SSO if not already signed in.
Lands on the rendered page. Reviews. Comments on the PR. Approves.
Pushes another commit; the URL updates. PR closes; the URL goes away.

Nothing about that journey is new — every modern hosting provider does some version of it. The interesting question is the plumbing underneath, and which pieces we can fit together from what an org already runs.

candidate solution

A CLI that runs once per PR push, inside CI, and publishes an HTML file to a destination the org already controls. SSO gating belongs to that destination — the CLI doesn't authenticate viewers. The viewer's browser talks directly to the destination and the org's IdP; htmlbin.dev is not in the request path.

One CLI, three possible destinations chosen to match three organizational contexts:

cloud — htmlbin.dev. Public URLs. The default, but irrelevant for the org-private use case.
gh-pages — the org's GitHub Pages on a private repo with Pages → "Private."
cloudflare — Cloudflare Pages behind Cloudflare Access. Relevant for orgs without GitHub Enterprise.

Same CLI verb. Same single-file contract. Different destinations for different contexts. "Destination" is the abstract term throughout — call it the hosting provider when context matters.

how it would actually work — the gh-pages path

End-to-end on every PR push:

ENGINEER       GITHUB.COM      CI RUNNER             HOSTING          IdP
   │                  │                    │                    │                  │
   │ git push (PR)    │                    │                    │                  │
   │─────────────────>│                    │                    │                  │
   │                  │ trigger workflow   │                    │                  │
   │                  │───────────────────>│                    │                  │
   │                  │                    │                    │                  │
   │                  │                    │ npm ci && npm run build                 │
   │                  │                    │   → ./dist/index.html (?)             │
   │                  │                    │                    │                  │
   │                  │                    │ htmlbin publish ./dist/index.html --to gh-pages
   │                  │ Git Data API       │                    │                  │
   │                  │ atomic commit ←────│                    │                  │
   │                  │ to gh-pages/pr-N/  │                    │                  │
   │                  │                    │                    │                  │
   │                  │ Pages rebuild ──────────────────────────>│                  │
   │                  │                    │                    │                  │
   │                  │ sticky comment     │                    │                  │
   │                  │<───────────────────│                    │                  │
   │                  │                    │                    │                  │
   │ click URL        │                    │                    │                  │
   │──────────────────────────────────────────────────────────>│                  │
   │                                                            │ 302 → SSO ─────>│
   │ sign in ───────────────────────────────────────────────────────────────────>│
   │<─────────── session cookie ──────────────────────────────────────────────────│
   │ retry ────────────────────────────────────────────────────>│                  │
   │<───────────── 200 OK, HTML ───────────────────────────────│                  │

Three real commands run in the runner. No magic:

- run: |
    npm ci
    npm run build               # produces HTML — IF the repo's build does that
    npx @htmlbin/cli publish ./dist/index.html --to gh-pages
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

where it works cleanly — and where it doesn't

Three concrete user journeys map onto today's reality. One is the original problem; the other two are the rest of the world.

Works today

Static-site repos

Storybook · SPA framework builds · static-site generators. The build naturally emits HTML at ./dist/index.html or similar. The workflow publishes whatever the build produces.

~10 lines of YAML, no thinking required.

Works, narrow

Agent-driven local

Human at a terminal. Coding agent generates HTML to a file. htmlbin publish ships it to the cloud product. Existing flow; the CLI is just a friendlier curl.

Not the org-private journey — public URLs.

Gap — to solve

Non-static repos

Backend services · libraries · mobile apps · CLIs. No natural HTML artifact. The workflow as written has nothing to publish.

This is probably the majority of repos. The generation question (next section) is for this case.

the generation problem

For non-static repos, "the preview" doesn't pre-exist anywhere. Something has to look at the PR's changes and produce HTML that represents them. The CLI's contract is publish-only, so generation needs to land upstream of the publish step — and the generator has to be more than a single API call. The thing that makes today's coding agents useful is the harness: file access, shell execution, multi-turn refinement, validation. A raw model call against a diff gives you noise; an agent loop gives you a real artifact.

We are not going to build a generation engine. The agent-harness pattern is well-established across the coding-tools landscape; multiple existing systems already do PR-aware code work with real tool use — file system access, bash execution, multi-turn refinement. The pragmatic move is to leverage one of them.

Decision: Option 1. The agent runs in the CI runner; htmlbin-cli stays publish-only and consumes whatever file the agent writes. The other two options stay as alternatives we might revisit, but neither ships now.

scope · small · chosen

option 1 · going with this

Run a coding agent in your CI runner

The CI runner becomes the agent's sandbox. A headless agent CLI (e.g. claude -p, codex exec) runs inside the runner with full tool use: Read / Write / Edit on the checked-out repo, Bash to run the build, multi-turn loop with reflection. The prompt: "produce an HTML representation of this PR's changes at ./preview.html." Then htmlbin publish ./preview.html --to gh-pages ships the file. htmlbin-cli's contract is unchanged — it still takes HTML as input.

This is not "an API call with a prompt." It's the same agent loop you'd run interactively, in non-interactive mode. The harness is what makes the output good — single-shot text generation against a diff produces noise; an agent that can read related files, run the build, and iterate produces a real preview.

Why this one: repo content never leaves the CI environment (only model-API egress); vendor-neutral (any headless coding-agent CLI works); zero new htmlbin surface area; the prompt is config the team can iterate on without touching us.

What ships: a reference workflow at cli/examples/agent-preview-workflow.yml showing the pattern with Claude Code, with comments calling out the swap-points for other agent CLIs.

Plays well with: Claude Code -p mode · OpenAI Codex CLI (codex exec) · any other agent CLI that supports headless / non-interactive output

scope · medium · alternative

option 2

Trigger a remote agent in the vendor's cloud

Same agent harness pattern, but the harness runs on the vendor's infra rather than in your CI. A workflow webhook triggers the vendor on PR open; the vendor's sandbox checks out the repo, runs the agent loop, commits the HTML output back to a preview branch; htmlbin publish ships it.

Why not now: the repo content becomes visible to the vendor's sandbox (matters for sensitive code); each vendor has its own trigger model and billing surface; Option 1 already handles the "agent in cloud" case for orgs paying for one of these — the vendor's CLI typically works inside Option 1's CI step too.

Plays well with: GitHub Copilot Coding Agent · Cursor background agents · Devin · vendors with hosted agent products

scope · large · alternative

option 3

htmlbin grows a `compose` command that wraps Option 1

The CLI would add htmlbin compose --pr 1234 --output ./preview.html as sugar over Option 1: shell out to one specific agent harness with a canonical prompt.

Why not now: we'd own the prompt (a real product) and pick a default agent (a real bet) before we know what "good preview HTML for a PR" actually looks like. Right path only after a few teams have run Option 1 and converged on a prompt shape. Revisit once we have real data.

the risks we know about

Risk · The SSO gate has a paid floor

"Pages → Private" requires GitHub Enterprise Cloud (~$21/user/month) or Teams with private Pages enabled. Free and personal orgs serve public Pages regardless of repo visibility. The gh-pages backend literally does not gate anything for the orgs that need it most. Cloudflare Access fills that hole — free up to 50 users — but adds setup friction.

Risk · Generation quality is the actual product risk

Once we go past static-site repos, the preview is only as good as the agent that generates it. "Render an HTML representation of the diff" is a vague prompt that will produce inconsistent output. Whether reviewers actually find the preview useful — versus reading the diff directly — is an empirical question we haven't tested.

Risk · Hosting-provider rebuild lag

~60 seconds between commit and the URL being live on GitHub Pages (Cloudflare Pages is faster but still not instant). The sticky comment posts immediately, so reviewers might click during the gap and see a 404. Recoverable with a refresh but unpleasant.

Risk · Three different tokens, three different auth models

hb_* for cloud · GitHub PAT for gh-pages · Cloudflare API token for the cloudflare backend. Error messages name the destination that failed, but the documentation overhead is real and the configuration story is muddier than "log in once."

Risk · gh-pages and cloudflare are unverified end-to-end

Unit tests pass for both. Cloud backend has been exercised against production. Neither alt destination has been exercised against a real sandbox yet. The Octokit git.getRef call's URL-encoding of heads/gh-pages looked suspicious in one test; the 404 from a non-existent repo was ambiguous. We don't know if there's a bug there.

Risk · Cloudflare's setup curve is steeper than gh-pages

Sign up for Zero Trust, get an API token with the right scopes, find the account ID, run setup with IdP/email flags. More steps than "flip Pages → Private." The free-tier 50-user limit is real and not obvious.

what we're deliberately not solving

Out · Running the user's app server in our infra

For dynamic apps, the most accurate "preview" is the running app with the PR's code applied. That requires per-PR runtime — Lambdas, containers, edge functions, sandboxes. A real product, but a different one from htmlbin. Not in scope.

Out · A built-in model or generation prompt that we maintain

Even if we ship Option 3 (an in-CLI compose command), the prompt and provider stay user-facing and configurable. We're not in the business of operating a model or guaranteeing output quality.

Out · Multi-file drops in v1

One HTML file in, one URL out. Assets must be inline or CDN-hosted. Phase 2 if there's demand.

Out · Destinations past v1

Other hosting providers (Vercel · Netlify · GitLab Pages · S3 + Cognito · plain filesystem) are all defined as Phase 2 entries on the same backend interface. None ship in v1. The interface is small enough that adding one is a single file.

Out · Versioning UX

The cloud destination gets versioning from the Worker. gh-pages overwrites (PR's git history is the log). Cloudflare keeps every deployment but only one alias per slug. No unified version-pinning command in the CLI; we'll add it if users actually ask.

what's actually in tree

v1 of the CLI lives on a working branch. Concrete state:

Surface	State
4-method `Backend` interface + three backends	typecheck clean
Unit tests (repo parsing, config resolution, error mapping, cloud destination via mocked HTTP)	50/50 pass
Drop-in CI workflow with three branching sticky comments (preview / no HTML / build failed)	copy-ready
Cloud destination	exercised against production end-to-end
gh-pages destination	unverified end-to-end
cloudflare destination	unverified end-to-end
Generation — Option 1 reference workflow (Claude Code in CI → htmlbin publish)	shipping with this update · prompt untested on real PRs

CLI ergonomics — with attribution

The patterns that make a CLI feel native to coding agents aren't ours to invent. DataDog's pup CLI and its public design notes — along with the Speakeasy team's writeup on engineering an agent-friendly CLI — lay this out clearly enough that the right move is to copy what works. A few conventions we adopted directly:

Auto-detect the runner. When a known coding-agent env signature is present (Claude Code, Cursor, Codex, Aider, Cline, and several others), the CLI defaults to machine-readable output. The agent doesn't have to know about the flag. A manual override covers runners we haven't named.
One error shape, end to end. The CLI's machine errors mirror the API's error.code contract — same keys agents already parse on the wire. No second vocabulary to document.
Richer User-Agent. Every outbound call carries CLI version, runtime version, OS / architecture, and the detected agent name — visible to operators server-side without changing the API shape.
Stable, categorized exit codes — already in tree; pup's doc validated that the categories we picked (auth, not-found, rate-limit, size, input, network) are the conventional ones.

Three pup patterns we deliberately didn't take, with reasons:

OS-keychain token storage. Conflicts with the file-based storage the agent protocol descriptor at /api/onboard advertises, and agents typically can't reach the keychain. Revisit if humans complain about plaintext-at-rest.
List response envelope (count / truncation / warnings around a data field). Breaking shape change to our current JSON array — worth doing alongside real pagination, not standalone.
Verbose API tracing flag. Useful once we're running the non-cloud destinations end-to-end against sandboxes. Not now.

Credit where due: github.com/DataDog/pup and Speakeasy's engineering-agent-friendly-cli post. Both are good reads for anyone shipping a CLI that agents will use.

open questions

What does "good preview HTML for a PR" actually look like? Until we try it on a real PR with the Option 1 workflow, we're guessing. The prompt in the reference workflow is a starting point; teams will tune it for their change types. The empirical answer drives whether Option 3 (htmlbin owning a prompt) ever makes sense.
Which agent CLI do we lead the reference workflow with? Claude Code -p is the chosen example because the syntax is the most concise. Codex CLI codex exec is documented as a swap-point. We're not endorsing one over the other.
Is the Octokit git.getRef URL-encoding a real bug? A 30-minute test against a real private repo answers it.
Does the three-token UX hold up? Or do we eventually need a per-destination htmlbin login with a unified config?
Is "three destinations" the right framing, or does it dilute the product? Cloud + one alt might be cleaner than cloud + two alts that overlap.

what to learn next

Two experiments answer most of what's still uncertain:

End-to-end gh-pages flow against a real private repo + GH Enterprise Cloud. Confirm SSO redirect works, the Octokit URL encoding isn't actually broken, and the rebuild lag is tolerable.
Run the Option 1 reference workflow on a real non-static repo PR. Use the shipped agent-preview-workflow.yml, swap in an Anthropic API key (or Codex equivalent), and see what HTML the agent actually produces. Iterate on the prompt; record what kinds of PRs the pattern handles well vs. where it produces noise. That data tells us whether Option 3 (sugar layer) is worth building later, and whether to ship a curated set of prompts for common change types.

The code in tree can stay either way. Those two experiments tell us whether to keep building, change scope, or pick a different direction entirely.