@htmlbin/cli — working notes

An honest pass at the problem before we declare it solved. What we're trying to enable, the shape of a candidate solution, where it works, where it breaks, and what's still an open call.

the problem

@engineer in an org with private repos.
Opens a PR with a frontend change.
Wants their reviewer to see the rendered result, not just the diff.
Doesn't want to send the link outside the company. Doesn't want to add another paid SaaS.
Wants this to happen automatically, every PR, gated to org members.

"PR-preview deploys" is a known shape — most modern hosting providers offer some flavor of it. But those are external services with their own data flows, their own billing, and their own ideas about who can see what. For some orgs that's fine. For others — regulated, security-conscious, or just "we already pay for this infra" — it isn't.

The narrow question we're trying to answer: can we get the same end-user experience using only the infrastructure the org already has?

the user journey we'd want to enable

Concretely, what the engineer should experience:

  1. Pushes commits to a branch, opens a PR — exactly as they do today.
  2. Within a minute, a sticky comment appears on the PR: "🔍 Preview: <url>"
  3. Clicks the URL. Browser gets redirected through the org's SSO if not already signed in.
  4. Lands on the rendered page. Reviews. Comments on the PR. Approves.
  5. Pushes another commit; the URL updates. PR closes; the URL goes away.

Nothing about that journey is new — every modern hosting provider does some version of it. The interesting question is the plumbing underneath, and which pieces we can fit together from what an org already runs.

candidate solution

A CLI that runs once per PR push, inside CI, and publishes an HTML file to a destination the org already controls. SSO gating belongs to that destination — the CLI doesn't authenticate viewers. The viewer's browser talks directly to the destination and the org's IdP; htmlbin.dev is not in the request path.

One CLI, three possible destinations chosen to match three organizational contexts:

Same CLI verb. Same single-file contract. Different destinations for different contexts. "Destination" is the abstract term throughout — call it the hosting provider when context matters.

how it would actually work — the gh-pages path

End-to-end on every PR push:

ENGINEER       GITHUB.COM      CI RUNNER             HOSTING          IdP
   │                  │                    │                    │                  │
   │ git push (PR)    │                    │                    │                  │
   │─────────────────>│                    │                    │                  │
   │                  │ trigger workflow   │                    │                  │
   │                  │───────────────────>│                    │                  │
   │                  │                    │                    │                  │
   │                  │                    │ npm ci && npm run build                 │
   │                  │                    │   → ./dist/index.html (?)             │
   │                  │                    │                    │                  │
   │                  │                    │ htmlbin publish ./dist/index.html --to gh-pages
   │                  │ Git Data API       │                    │                  │
   │                  │ atomic commit ←────│                    │                  │
   │                  │ to gh-pages/pr-N/  │                    │                  │
   │                  │                    │                    │                  │
   │                  │ Pages rebuild ──────────────────────────>│                  │
   │                  │                    │                    │                  │
   │                  │ sticky comment     │                    │                  │
   │                  │<───────────────────│                    │                  │
   │                  │                    │                    │                  │
   │ click URL        │                    │                    │                  │
   │──────────────────────────────────────────────────────────>│                  │
   │                                                            │ 302 → SSO ─────>│
   │ sign in ───────────────────────────────────────────────────────────────────>│
   │<─────────── session cookie ──────────────────────────────────────────────────│
   │ retry ────────────────────────────────────────────────────>│                  │
   │<───────────── 200 OK, HTML ───────────────────────────────│                  │

Three real commands run in the runner. No magic:

- run: |
    npm ci
    npm run build               # produces HTML — IF the repo's build does that
    npx @htmlbin/cli publish ./dist/index.html --to gh-pages
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

where it works cleanly — and where it doesn't

Three concrete user journeys map onto today's reality. One is the original problem; the other two are the rest of the world.

Works today

Static-site repos

Storybook · SPA framework builds · static-site generators. The build naturally emits HTML at ./dist/index.html or similar. The workflow publishes whatever the build produces.

~10 lines of YAML, no thinking required.

Works, narrow

Agent-driven local

Human at a terminal. Coding agent generates HTML to a file. htmlbin publish ships it to the cloud product. Existing flow; the CLI is just a friendlier curl.

Not the org-private journey — public URLs.

Gap — to solve

Non-static repos

Backend services · libraries · mobile apps · CLIs. No natural HTML artifact. The workflow as written has nothing to publish.

This is probably the majority of repos. The generation question (next section) is for this case.

the generation problem

For non-static repos, "the preview" doesn't pre-exist anywhere. Something has to look at the PR's changes and produce HTML that represents them. The CLI's contract is publish-only, so generation needs to land upstream of the publish step — and the generator has to be more than a single API call. The thing that makes today's coding agents useful is the harness: file access, shell execution, multi-turn refinement, validation. A raw model call against a diff gives you noise; an agent loop gives you a real artifact.

We are not going to build a generation engine. The agent-harness pattern is well-established across the coding-tools landscape; multiple existing systems already do PR-aware code work with real tool use — file system access, bash execution, multi-turn refinement. The pragmatic move is to leverage one of them.

Decision: Option 1. The agent runs in the CI runner; htmlbin-cli stays publish-only and consumes whatever file the agent writes. The other two options stay as alternatives we might revisit, but neither ships now.

scope · medium · alternative
option 2

Trigger a remote agent in the vendor's cloud

Same agent harness pattern, but the harness runs on the vendor's infra rather than in your CI. A workflow webhook triggers the vendor on PR open; the vendor's sandbox checks out the repo, runs the agent loop, commits the HTML output back to a preview branch; htmlbin publish ships it.

Why not now: the repo content becomes visible to the vendor's sandbox (matters for sensitive code); each vendor has its own trigger model and billing surface; Option 1 already handles the "agent in cloud" case for orgs paying for one of these — the vendor's CLI typically works inside Option 1's CI step too.

Plays well with: GitHub Copilot Coding Agent · Cursor background agents · Devin · vendors with hosted agent products
scope · large · alternative
option 3

htmlbin grows a compose command that wraps Option 1

The CLI would add htmlbin compose --pr 1234 --output ./preview.html as sugar over Option 1: shell out to one specific agent harness with a canonical prompt.

Why not now: we'd own the prompt (a real product) and pick a default agent (a real bet) before we know what "good preview HTML for a PR" actually looks like. Right path only after a few teams have run Option 1 and converged on a prompt shape. Revisit once we have real data.

the risks we know about

Risk · The SSO gate has a paid floor

"Pages → Private" requires GitHub Enterprise Cloud (~$21/user/month) or Teams with private Pages enabled. Free and personal orgs serve public Pages regardless of repo visibility. The gh-pages backend literally does not gate anything for the orgs that need it most. Cloudflare Access fills that hole — free up to 50 users — but adds setup friction.

Risk · Generation quality is the actual product risk

Once we go past static-site repos, the preview is only as good as the agent that generates it. "Render an HTML representation of the diff" is a vague prompt that will produce inconsistent output. Whether reviewers actually find the preview useful — versus reading the diff directly — is an empirical question we haven't tested.

Risk · Hosting-provider rebuild lag

~60 seconds between commit and the URL being live on GitHub Pages (Cloudflare Pages is faster but still not instant). The sticky comment posts immediately, so reviewers might click during the gap and see a 404. Recoverable with a refresh but unpleasant.

Risk · Three different tokens, three different auth models

hb_* for cloud · GitHub PAT for gh-pages · Cloudflare API token for the cloudflare backend. Error messages name the destination that failed, but the documentation overhead is real and the configuration story is muddier than "log in once."

Risk · gh-pages and cloudflare are unverified end-to-end

Unit tests pass for both. Cloud backend has been exercised against production. Neither alt destination has been exercised against a real sandbox yet. The Octokit git.getRef call's URL-encoding of heads/gh-pages looked suspicious in one test; the 404 from a non-existent repo was ambiguous. We don't know if there's a bug there.

Risk · Cloudflare's setup curve is steeper than gh-pages

Sign up for Zero Trust, get an API token with the right scopes, find the account ID, run setup with IdP/email flags. More steps than "flip Pages → Private." The free-tier 50-user limit is real and not obvious.

what we're deliberately not solving

Out · Running the user's app server in our infra

For dynamic apps, the most accurate "preview" is the running app with the PR's code applied. That requires per-PR runtime — Lambdas, containers, edge functions, sandboxes. A real product, but a different one from htmlbin. Not in scope.

Out · A built-in model or generation prompt that we maintain

Even if we ship Option 3 (an in-CLI compose command), the prompt and provider stay user-facing and configurable. We're not in the business of operating a model or guaranteeing output quality.

Out · Multi-file drops in v1

One HTML file in, one URL out. Assets must be inline or CDN-hosted. Phase 2 if there's demand.

Out · Destinations past v1

Other hosting providers (Vercel · Netlify · GitLab Pages · S3 + Cognito · plain filesystem) are all defined as Phase 2 entries on the same backend interface. None ship in v1. The interface is small enough that adding one is a single file.

Out · Versioning UX

The cloud destination gets versioning from the Worker. gh-pages overwrites (PR's git history is the log). Cloudflare keeps every deployment but only one alias per slug. No unified version-pinning command in the CLI; we'll add it if users actually ask.

what's actually in tree

v1 of the CLI lives on a working branch. Concrete state:

Surface State
4-method Backend interface + three backends typecheck clean
Unit tests (repo parsing, config resolution, error mapping, cloud destination via mocked HTTP) 50/50 pass
Drop-in CI workflow with three branching sticky comments (preview / no HTML / build failed) copy-ready
Cloud destination exercised against production end-to-end
gh-pages destination unverified end-to-end
cloudflare destination unverified end-to-end
Generation — Option 1 reference workflow (Claude Code in CI → htmlbin publish) shipping with this update · prompt untested on real PRs

CLI ergonomics — with attribution

The patterns that make a CLI feel native to coding agents aren't ours to invent. DataDog's pup CLI and its public design notes — along with the Speakeasy team's writeup on engineering an agent-friendly CLI — lay this out clearly enough that the right move is to copy what works. A few conventions we adopted directly:

Three pup patterns we deliberately didn't take, with reasons:

Credit where due: github.com/DataDog/pup and Speakeasy's engineering-agent-friendly-cli post. Both are good reads for anyone shipping a CLI that agents will use.

open questions

  1. What does "good preview HTML for a PR" actually look like? Until we try it on a real PR with the Option 1 workflow, we're guessing. The prompt in the reference workflow is a starting point; teams will tune it for their change types. The empirical answer drives whether Option 3 (htmlbin owning a prompt) ever makes sense.
  2. Which agent CLI do we lead the reference workflow with? Claude Code -p is the chosen example because the syntax is the most concise. Codex CLI codex exec is documented as a swap-point. We're not endorsing one over the other.
  3. Is the Octokit git.getRef URL-encoding a real bug? A 30-minute test against a real private repo answers it.
  4. Does the three-token UX hold up? Or do we eventually need a per-destination htmlbin login with a unified config?
  5. Is "three destinations" the right framing, or does it dilute the product? Cloud + one alt might be cleaner than cloud + two alts that overlap.

what to learn next

Two experiments answer most of what's still uncertain:

  1. End-to-end gh-pages flow against a real private repo + GH Enterprise Cloud. Confirm SSO redirect works, the Octokit URL encoding isn't actually broken, and the rebuild lag is tolerable.
  2. Run the Option 1 reference workflow on a real non-static repo PR. Use the shipped agent-preview-workflow.yml, swap in an Anthropic API key (or Codex equivalent), and see what HTML the agent actually produces. Iterate on the prompt; record what kinds of PRs the pattern handles well vs. where it produces noise. That data tells us whether Option 3 (sugar layer) is worth building later, and whether to ship a curated set of prompts for common change types.

The code in tree can stay either way. Those two experiments tell us whether to keep building, change scope, or pick a different direction entirely.