Reviewing is as important as building. A mirror that passes automated
checks can still be trivially easy, visually broken, or functionally
shallow. Your job is to catch what machines miss. Review 5 environments
to join the author list.
Tooling
We provide a review-env skill that guides you through
each checklist systematically. Clone the repo and use
coding agent (e.g., Claude Code or CodeX) to assist your review:
shell
git clone https://github.com/aiming-lab/WebHarbor.git && cd WebHarbor
The review-env skill includes visual fidelity checks, functional depth tests,
task quality audits with leak detection, and a structured PR review template.
The harden-env skill covers the 13 known leak archetypes we've catalogued from
building the initially released 15 mirrors.
Review pipeline overview
1
Check out & build
gh pr checkout, ./scripts/fetch_assets.sh, ./scripts/build.sh, then run on alt ports (:8201, :41000-41014) so you don't collide with any running container.
Side-by-side with the real site. Auth, search, CRUD, detail pages, form validation all work.
→
4
Task quality audit
Walk through each task in tasks.jsonl: solvability, answer-leak detection, distractor density, difficulty.
Mechanical checks
If any of these fail, request changes immediately — don't bother
with the deeper review yet.
shell
gh pr checkout <pr-number>
./scripts/fetch_assets.sh
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-review \
-p 8201:8101 -p 41000-41014:40000-40014 webharbor:dev
# all 15 sites return 200
for p in $(seq 41000 41014); do
curl -so /dev/null -w "$p:%{http_code}\n" http://localhost:$p/
done
# byte-identical reset
curl -X POST http://localhost:8201/reset/<site>
docker exec wh-review md5sum \
/opt/WebSyn/<site>/instance/<site>.db \
/opt/WebSyn/<site>/instance_seed/<site>.db
# the two md5s MUST match
# parallel reset still works for everyone
time curl -X POST http://localhost:8201/reset-all
# verify HF revision pin
cat .assets-revision
# check the SHA exists at huggingface.co/datasets/ChilleD/WebHarbor
Checklist 1: Visual fidelity
Open the mirror and the real website side by side. Check:
Layout accuracy
Does the page structure (header, nav, sidebar, footer, grid) match the real site? Are major sections in the right order?
Real images
Are product photos, article images, and logos real (not placeholders or colored rectangles)? Check at least 3 detail pages.
Typography & colors
Do fonts, colors, and spacing roughly match? The mirror doesn't need pixel-perfection, but should feel like the same brand.
Working navigation
Click every top-level nav link, category link, and footer link. Do they lead to real pages (not 404s)?
Checklist 2: Functional depth
Test the site's interactive features by actually using them:
Auth flows
Register a new account, log out, log back in. Reset password if available. Test with seeded accounts (alice.j@test.com / TestPass123!).
Search & browse
Search for 3+ different queries. Do results make sense? Are there enough results (≥6)? Does multi-word search work (not just exact match)?
CRUD operations
Add items to cart/bookmarks/lists. Remove them. Edit account settings. Submit forms. Check that state changes persist across page navigation.
Detail pages
Open 5+ detail pages (products, articles, courses). Are specs, descriptions, and metadata populated? Are images loading?
Checklist 3: Task quality & difficulty
This is the most important check. For each of the 15–20 proposed tasks:
A
Answer leak check
Can you find the answer without navigating to a detail page?
If the answer appears in a search result title, card subtitle, or
page heading, it's a leak. The answer must require
reading a detail page, comparing specs, or multi-step reasoning.
B
Distractor adequacy
When you search for the task's target, do you see items that look
similar but don't satisfy all constraints? There should be
near-miss distractors that require careful reading to distinguish
from the correct answer. If every result satisfies the task, it's
too easy.
C
Meaningful difficulty
Would a frontier model (e.g., GPT-5) solve this in one click?
Good tasks require multiple steps: search, filter, open detail pages,
compare, reason about constraints. Include at least 3–5 tasks
that require ≥5 steps.
Key question
Would you be satisfied if this task appeared in a published
benchmark? If the task is trivially solvable, has leaked
answers, or tests only surface-level browsing, request changes.
A mirror with 10 well-designed tasks is worth more than one with 100
trivial ones.
Common issues we've seen
Based on reviewing 15+ mirrors, these are the patterns that coding agents
most frequently get wrong:
Issue
What it looks like
How to fix
Answer in title
Task: "find a camera with 10x zoom" → product named "Canon ELPH 360 (12x Zoom)"
Strip constraint values from card-level fields. Push to specs/description only.
Single-item catalog
Task: "buy iPhone 17" → database contains only 1 phone
Seed 20+ phones across brands, models, price ranges. Agent must search and compare.
Broken search
"Boston Celtic players" returns 0 results (strict AND match fails)