This guide walks you through building a high-quality, multimodal
mirror site for WebHarbor. The process combines coding agents for
rapid construction with human review for quality assurance.
Expect 1 day of work per site.
Tooling
We provide coding agent skills that automate each phase of this pipeline.
Clone the repo and use coding agent (e.g., Claude Code or CodeX) with the built-in skills:
shell
git clone https://github.com/aiming-lab/WebHarbor.git && cd WebHarbor
Finalize DB seeds, scored search, persistence, test user accounts
⚡ Fastest path: one-shot prompt to Coding Agent
Once you've cloned the repo (so the .claude/skills/
directory is available), paste the prompt below into
your favorite coding agent with your target website
filled in. The agent will run Phases 1–5 end-to-end and stop
before submitting the PR.
Tip: for best results, we recommend using GPT-5.5 or Claude Opus 4.6 level models with high reasoning effort.
prompt for coding agent
I want to contribute a new website mirror to WebHarbor.
Target site: <REAL_URL> # e.g. https://www.target-site.com/
Site slug: <SLUG> # e.g. target_site (lowercase, snake_case)
Follow the WebHarbor contribution pipeline end-to-end using the local skills under .claude/skills/. Specifically:
Phase 1 — Use the `clone-website` skill:
- Run ./scripts/new_site.py <SLUG> to scaffold sites/<SLUG>/
- Register the site in websyn_start.sh, control_server.py, Dockerfile
- Scrape structure, harvest real assets (no placeholders), build the Flask + SQLAlchemy app
- Replicate the frontend with Jinja2 templates matching the original site
- Seed an initial idempotent DB (seed_database + seed_benchmark_users with alice.j@test.com et al.)
Phase 2 — Use the `design-tasks` skill:
- Write 15-20 benchmark tasks to sites/<SLUG>/tasks.jsonl
- Cover the site's full functional breadth (search, browse, cart, checkout, account, etc.)
- Include 3-5 hard tasks that require multi-step reasoning
- Use the WebVoyager schema: {web_name, id, ques, web, upstream_url}
Phase 3 — Use the `evolve-env` skill:
- Manually walk through each task; extend the mirror to support it
- Detect and fix task info leaks, superficial completion, insufficient distractors
Phase 4 — Use the `harden-env` skill:
- Audit every task against the 4 hardening dimensions (de-leak / distractors / catalog breadth / cross-field consistency)
- Check the 13 known leak archetypes
- Re-verify byte-identical reset
Phase 5 — Use the `seed-database` skill:
- Confirm all seed_*() functions are idempotent at the function level
- Stabilize instance_seed/<SLUG>.db (boot-and-freeze cycle until md5 matches)
- Implement scored token-overlap search if not already
Verification (after each phase and at the end):
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-test -p 8201:8101 -p 41000-41014:40000-40014 webharbor:dev
curl -X POST http://localhost:8201/reset/<SLUG>
docker exec wh-test md5sum /opt/WebSyn/<SLUG>/instance/<SLUG>.db /opt/WebSyn/<SLUG>/instance_seed/<SLUG>.db
# the two md5s MUST match
Stop before opening the PR. Print a summary of:
- Files added / modified
- Number of seeded rows per major model
- Tasks count in tasks.jsonl
- Byte-identical reset confirmation
- Anything that needs human review or fixing
- Detailed steps how to finally submit the PR (HuggingFace assets PR + GitHub PR + .assets-revision bump)
DO NOT STOP UNLESS YOU FINISH ALL THE STEPS. THE WHOLE TASK CAN BE HOURS OF WORK, SO BE PATIENT AND PERSISTENT. IF YOU ENCOUNTER AN ERROR, FIX IT AND KEEP GOING.
I will review your output and then drive the PR submission myself (HuggingFace assets PR + GitHub PR + .assets-revision bump).
Step-by-step manual reference below ↓
Want to understand each phase yourself, or run them one at a time?
The full step-by-step walkthrough follows. Useful both for first-time
contributors and for debugging when the one-shot prompt above gets
stuck.
Phase 0: Claim your website
Before you start
Browse the website tracking sheet
and find an unclaimed site. Submit the
contribution form
so we can lock it for you. You're expected to hear from us within 48 hours.
Phase 1: Fork, scaffold, and clone
WebHarbor lives across two repositories: code on GitHub and heavy
assets (seed DBs, images) on the HuggingFace datasetChilleD/WebHarbor. Fork both, then scaffold a new
site under sites/<your_site>/:
shell
git clone https://github.com/<you>/WebHarbor && cd WebHarbor
./scripts/fetch_assets.sh # pull current HF assets (~2.8 GB)
./scripts/new_site.py mysite # scaffold sites/mysite/
The scaffold creates the standard skeleton:
layout
sites/mysite/
├── app.py ← routes + SQLAlchemy models
├── seed_data.py ← build-time seed (must be idempotent)
├── _health.py ← end-to-end health check
├── templates/ ← Jinja2 templates
├── static/{css,js,icons}/ ← small UI, in git
├── static/images/ ← heavy, in HF dataset
├── instance_seed/<site>.db ← seed DB, in HF dataset
└── tasks.jsonl ← benchmark tasks
Register the new site in three places (must stay in sync):
websyn_start.sh (the SITES=( ... ) array),
control_server.py (the SITES = [ ... ] list), and
Dockerfile (EXPOSE 8101 40000-N).
Scrape structure
Map the live site's page hierarchy, navigation, and URL patterns. Output goes into scraped_data/ (gitignored).
Harvest real assets
Download product images, article photos, logos. Place under static/images/ (HF-managed). No placeholders.
Jinja2 templates that match the original site's layout, typography, and responsive behavior.
Phase 2: Design tasks (15–20 per site)
Tasks define what the environment must support. Write them to
sites/<site>/tasks.jsonl using the WebVoyager schema
(one JSON object per line):
tasks.jsonl
{"web_name": "Mysite", "id": "Mysite--0", "ques": "Search for ...", "web": "http://localhost:40015/", "upstream_url": "https://www.mysite.com/"}
{"web_name": "Mysite", "id": "Mysite--1", "ques": "Filter products under $30 with 4+ stars ...", "web": "http://localhost:40015/", "upstream_url": "https://www.mysite.com/"}
You can adapt tasks from existing benchmarks (WebVoyager, Online-Mind2Web)
or synthesize new ones with an LLM — LLMs have strong knowledge of
popular websites and make effective task generators.
Coverage principle
Tasks must cover the site's full functional breadth, not just one feature.
For example, an Amazon mirror should include tasks across:
searching products, filtering by category, reading reviews,
adding to cart, checkout flow, order history, account settings,
address management, and payment methods.
Difficulty principle
Include tasks that current frontier models cannot easily solve:
multi-step workflows, disambiguation scenarios (user has 3 payment
cards, which one?), cross-page reasoning, and tasks requiring visual
understanding of product images or map layouts.
Phase 3: Task-driven environment evolution
Feed the tasks to a coding agent and let it evolve the environment to
support each task. The agent will add routes, templates, database seeds,
and form handlers on demand.
!
Task info leak
Coding agents frequently make tasks trivially easy. A "find product X"
task becomes solvable without searching because the product is the only
item displayed, or appears in the page title. The answer should
never be visible without navigating, searching, and reading.
We believe this is related to reward hacking in agent training.
!
Superficial completion
Agents often produce pages that pass automated checks but fail under
real interaction: placeholder text, broken forms, missing images,
search that only returns exact matches, or checkout flows that
skip validation steps.
!
Insufficient distractors
If the task asks "buy an iPhone 17", the database must not contain only
one phone. Seed diverse distractor items: multiple phone
models, brands, and price ranges. The agent must compare and select,
not just click the only option.
Phase 4: Hardening (critical)
This is where human review is indispensable. Systematically check each
task against the 3 hardening dimensions:
A. De-leak answers
Task constraint values must NOT appear in card titles, search result snippets, or page headings. Push answer details to spec tables, description prose, or detail-page sections that require click-through.
B. Add near-miss distractors
For each task, ensure the search results contain items that match the query category but fail ONE constraint. Target ≤50% full-match density so agents must read specs.
C. Broaden the catalog
Search queries should return ≥6 results from multiple sub-categories. If a zip code search returns only pizza shops, add coffee shops, gyms, banks, and pharmacies in the same area.
D. Cross-field consistency
When modifying any product/item field, regenerate ALL related fields (specs, description, features, tags) from the same source of truth to prevent contradictions.
Why humans are essential
Automated tests verify that the environment supports a task,
but only a human can verify it's challenging. Can the answer
be found without scrolling? Does the search show only the target item?
Is the layout visually faithful to the real site? These are judgment
calls that require human eyes.
Phase 5: Stabilize the seed DB & ship to HuggingFace
Idempotent seeding (the byte-identical reset invariant)
Every seed_*() function in app.py /
seed_data.py MUST early-return when the DB is already
populated. Per-row gates are NOT enough — even a no-op
db.session.commit() bumps SQLite metadata and breaks
/reset/<site> byte-identity. Gate every seed
function as a whole:
python
def seed_database():
if Product.query.count() > 0:
return # gate the whole function
# ... seed rows ...
def seed_benchmark_users():
if User.query.filter_by(email='alice.j@test.com').first():
return
# ... seed 4 users ...
Realistic volume
Seed 50–200 items per major entity. Real sites have thousands; a small but diverse catalog preserves the browsing experience.
Seed 4 benchmark users (alice.j@test.com etc., password TestPass123!) with pre-existing carts, bookmarks, orders, and profiles for auth-gated tasks.
Runtime data in DB, not JSON
HTTP handlers read from SQLAlchemy, not JSON files. Fold scrape JSON into instance_seed/<site>.db at build time via seed_data.py.
Two-repo workflow
Heavy assets (instance_seed/*.db,
static/images/) live on the HuggingFace dataset
ChilleD/WebHarbor, not directly in git.
.assets-revision pins the exact HF commit. After your
DB / images are stable, run:
shell
./scripts/extract_assets.sh ../wh-static-pr/
cd ../wh-static-pr
hf upload-large-folder <your-fork>/WebHarbor . --repo-type dataset
# open PR on https://huggingface.co/datasets/ChilleD/WebHarbor
# after merge, bump .assets-revision in the code repo
Phase 6: Pre-PR checks & submission
Run all of these before opening the GitHub PR:
shell
# 1. syntax
python3 -m py_compile sites/<site>/app.py
# 2. build
./scripts/build.sh webharbor:dev
# 3. run on alt ports
docker run -d --rm --name wh-test \
-p 8201:8101 -p 41000-41014:40000-40014 webharbor:dev
# 4. all 15 sites return 200
for p in $(seq 41000 41014); do
curl -so /dev/null -w "$p:%{http_code}\n" http://localhost:$p/
done
# 5. byte-identical reset invariant
curl -X POST http://localhost:8201/reset/<site>
docker exec wh-test md5sum \
/opt/WebSyn/<site>/instance/<site>.db \
/opt/WebSyn/<site>/instance_seed/<site>.db
# the two md5s MUST match
docker stop wh-test
Your GitHub PR description should include:
The real site mirrored + URL
Number of seeded rows per major model
Link to the paired HuggingFace PR (asset side)
Output of POST /reset/<site> showing ready: true
Screenshot evidence of visual fidelity vs. the real site