Review Pipeline Guide

Reviewer Guide

How to review a website environment

Reviewing is as important as building. A mirror that passes automated checks can still be trivially easy, visually broken, or functionally shallow. Your job is to catch what machines miss. Review 5 environments to join the author list.

Tooling

We provide a review-env skill that guides you through each checklist systematically. Clone the repo and use coding agent (e.g., Claude Code or CodeX) to assist your review:

shell

git clone https://github.com/aiming-lab/WebHarbor.git && cd WebHarbor

The review-env skill includes visual fidelity checks, functional depth tests, task quality audits with leak detection, and a structured PR review template. The harden-env skill covers the 13 known leak archetypes we've catalogued from building the initially released 15 mirrors.

Review pipeline overview

Check out & build

gh pr checkout, ./scripts/fetch_assets.sh, ./scripts/build.sh, then run on alt ports (:8201, :41000-41014) so you don't collide with any running container.

→

Mechanical checks

All 15 sites return 200, byte-identical reset passes (md5sum instance/<site>.db == instance_seed/<site>.db), HF revision SHA verified.

→

Visual + functional

Side-by-side with the real site. Auth, search, CRUD, detail pages, form validation all work.

→

Task quality audit

Walk through each task in tasks.jsonl: solvability, answer-leak detection, distractor density, difficulty.

Mechanical checks

If any of these fail, request changes immediately — don't bother with the deeper review yet.

shell

gh pr checkout <pr-number>
./scripts/fetch_assets.sh
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-review \
  -p 8201:8101 -p 41000-41014:40000-40014 webharbor:dev

# all 15 sites return 200
for p in $(seq 41000 41014); do
  curl -so /dev/null -w "$p:%{http_code}\n" http://localhost:$p/
done

# byte-identical reset
curl -X POST http://localhost:8201/reset/<site>
docker exec wh-review md5sum \
  /opt/WebSyn/<site>/instance/<site>.db \
  /opt/WebSyn/<site>/instance_seed/<site>.db
# the two md5s MUST match

# parallel reset still works for everyone
time curl -X POST http://localhost:8201/reset-all

# verify HF revision pin
cat .assets-revision
# check the SHA exists at huggingface.co/datasets/ChilleD/WebHarbor

Checklist 1: Visual fidelity

Open the mirror and the real website side by side. Check:

Layout accuracy

Does the page structure (header, nav, sidebar, footer, grid) match the real site? Are major sections in the right order?

Real images

Are product photos, article images, and logos real (not placeholders or colored rectangles)? Check at least 3 detail pages.

Typography & colors

Do fonts, colors, and spacing roughly match? The mirror doesn't need pixel-perfection, but should feel like the same brand.

Working navigation

Click every top-level nav link, category link, and footer link. Do they lead to real pages (not 404s)?

Checklist 2: Functional depth

Test the site's interactive features by actually using them:

Auth flows

Register a new account, log out, log back in. Reset password if available. Test with seeded accounts (alice.j@test.com / TestPass123!).

Search & browse

Search for 3+ different queries. Do results make sense? Are there enough results (≥6)? Does multi-word search work (not just exact match)?

CRUD operations

Add items to cart/bookmarks/lists. Remove them. Edit account settings. Submit forms. Check that state changes persist across page navigation.

Detail pages

Open 5+ detail pages (products, articles, courses). Are specs, descriptions, and metadata populated? Are images loading?

Checklist 3: Task quality & difficulty

This is the most important check. For each of the 15–20 proposed tasks:

Answer leak check

Can you find the answer without navigating to a detail page? If the answer appears in a search result title, card subtitle, or page heading, it's a leak. The answer must require reading a detail page, comparing specs, or multi-step reasoning.

Distractor adequacy

When you search for the task's target, do you see items that look similar but don't satisfy all constraints? There should be near-miss distractors that require careful reading to distinguish from the correct answer. If every result satisfies the task, it's too easy.

Meaningful difficulty

Would a frontier model (e.g., GPT-5) solve this in one click? Good tasks require multiple steps: search, filter, open detail pages, compare, reason about constraints. Include at least 3–5 tasks that require ≥5 steps.

Key question

Would you be satisfied if this task appeared in a published benchmark? If the task is trivially solvable, has leaked answers, or tests only surface-level browsing, request changes. A mirror with 10 well-designed tasks is worth more than one with 100 trivial ones.

Common issues we've seen

Based on reviewing 15+ mirrors, these are the patterns that coding agents most frequently get wrong:

Issue	What it looks like	How to fix
Answer in title	Task: "find a camera with 10x zoom" → product named "Canon ELPH 360 (12x Zoom)"	Strip constraint values from card-level fields. Push to specs/description only.
Single-item catalog	Task: "buy iPhone 17" → database contains only 1 phone	Seed 20+ phones across brands, models, price ranges. Agent must search and compare.
Broken search	"Boston Celtic players" returns 0 results (strict AND match fails)	Use token-overlap scored search. Count matching tokens, filter score > 0.
Placeholder images	Products show colored rectangles or "Image coming soon" boxes	Scrape real product images from the live site. Every visible item needs a real photo.
Fake forms	Checkout form has fields but submit does nothing, or always succeeds without validation	Forms must validate, persist to DB, and redirect correctly. Test the full flow end-to-end.
Count leak	Task: "how many courses in X?" → heading says "4 courses"	Remove count labels. Show the items; let the agent count them.
Byte-identity fails	After `/reset/<site>`, md5 of `instance/<site>.db` doesn't match `instance_seed/<site>.db`	A `seed_*()` function isn't idempotent. Gate it at the function level (early-return on populated DB), not per-row.
JSON-backed handlers	Handler reads from `scraped_data/*.json` at request time	All runtime data must come from `instance_seed/<site>.db`. Fold scrape JSON into the seed DB via `seed_data.py`.
Cross-site imports	`sites/foo/app.py` imports from `sites/bar/`	Sites must be isolated. Each runs as its own Flask process. Grep for `from sites.` in the diff.
Stale HF pin	`.assets-revision` points to a nonexistent or unmerged HF commit	Verify the SHA exists at `huggingface.co/datasets/ChilleD/WebHarbor` and is merged into main.

How to submit your review

Go to the open Pull Requests on GitHub.
Pick a PR for creating a new environment.
Deploy locally and run through all 3 checklists above.
Leave a structured comment on the PR:
- Visual fidelity: PASS / FAIL (with screenshots)
- Functional depth: PASS / FAIL (list issues found)
- Task quality: PASS / FAIL per task (note any leaks or triviality)
If issues are found, work with the contributor to iterate until all checklists pass.
Approve the PR when satisfied.

Author list

Reviewing 5 environments (with thorough checklist reports) earns a spot on the final paper's author list. We track reviews via GitHub PR activity.

Browse open PRs Back to Contribution