Contributor Guide

How to synthesize a new website environment

This guide walks you through building a high-quality, multimodal mirror site for WebHarbor. The process combines coding agents for rapid construction with human review for quality assurance. Expect 1 day of work per site.

Tooling

We provide coding agent skills that automate each phase of this pipeline. Clone the repo and use coding agent (e.g., Claude Code or CodeX) with the built-in skills:

shell
git clone https://github.com/aiming-lab/WebHarbor.git && cd WebHarbor
Phase Skill What it does
Phase 1clone-websiteScrape, harvest assets, build Flask backend, replicate frontend
Phase 2design-tasksGenerate 15-20 benchmark tasks covering full site functionality
Phase 3evolve-envEvolve the mirror to support each task, detect agent pitfalls
Phase 4harden-envDe-leak answers, add distractors, broaden catalog, cross-field consistency
Phase 5seed-databaseFinalize DB seeds, scored search, persistence, test user accounts

⚡ Fastest path: one-shot prompt to Coding Agent

Once you've cloned the repo (so the .claude/skills/ directory is available), paste the prompt below into your favorite coding agent with your target website filled in. The agent will run Phases 1–5 end-to-end and stop before submitting the PR.

Tip: for best results, we recommend using GPT-5.5 or Claude Opus 4.6 level models with high reasoning effort.

prompt for coding agent
I want to contribute a new website mirror to WebHarbor.

Target site: <REAL_URL>     # e.g. https://www.target-site.com/
Site slug:   <SLUG>         # e.g. target_site (lowercase, snake_case)

Follow the WebHarbor contribution pipeline end-to-end using the local skills under .claude/skills/. Specifically:

Phase 1 — Use the `clone-website` skill:
  - Run ./scripts/new_site.py <SLUG> to scaffold sites/<SLUG>/
  - Register the site in websyn_start.sh, control_server.py, Dockerfile
  - Scrape structure, harvest real assets (no placeholders), build the Flask + SQLAlchemy app
  - Replicate the frontend with Jinja2 templates matching the original site
  - Seed an initial idempotent DB (seed_database + seed_benchmark_users with alice.j@test.com et al.)

Phase 2 — Use the `design-tasks` skill:
  - Write 15-20 benchmark tasks to sites/<SLUG>/tasks.jsonl
  - Cover the site's full functional breadth (search, browse, cart, checkout, account, etc.)
  - Include 3-5 hard tasks that require multi-step reasoning
  - Use the WebVoyager schema: {web_name, id, ques, web, upstream_url}

Phase 3 — Use the `evolve-env` skill:
  - Manually walk through each task; extend the mirror to support it
  - Detect and fix task info leaks, superficial completion, insufficient distractors

Phase 4 — Use the `harden-env` skill:
  - Audit every task against the 4 hardening dimensions (de-leak / distractors / catalog breadth / cross-field consistency)
  - Check the 13 known leak archetypes
  - Re-verify byte-identical reset

Phase 5 — Use the `seed-database` skill:
  - Confirm all seed_*() functions are idempotent at the function level
  - Stabilize instance_seed/<SLUG>.db (boot-and-freeze cycle until md5 matches)
  - Implement scored token-overlap search if not already

Verification (after each phase and at the end):
  ./scripts/build.sh webharbor:dev
  docker run -d --rm --name wh-test -p 8201:8101 -p 41000-41014:40000-40014 webharbor:dev
  curl -X POST http://localhost:8201/reset/<SLUG>
  docker exec wh-test md5sum /opt/WebSyn/<SLUG>/instance/<SLUG>.db /opt/WebSyn/<SLUG>/instance_seed/<SLUG>.db
  # the two md5s MUST match

Stop before opening the PR. Print a summary of:
  - Files added / modified
  - Number of seeded rows per major model
  - Tasks count in tasks.jsonl
  - Byte-identical reset confirmation
  - Anything that needs human review or fixing
  - Detailed steps how to finally submit the PR (HuggingFace assets PR + GitHub PR + .assets-revision bump) 


DO NOT STOP UNLESS YOU FINISH ALL THE STEPS. THE WHOLE TASK CAN BE HOURS OF WORK, SO BE PATIENT AND PERSISTENT. IF YOU ENCOUNTER AN ERROR, FIX IT AND KEEP GOING.

I will review your output and then drive the PR submission myself (HuggingFace assets PR + GitHub PR + .assets-revision bump).

Step-by-step manual reference below ↓

Want to understand each phase yourself, or run them one at a time? The full step-by-step walkthrough follows. Useful both for first-time contributors and for debugging when the one-shot prompt above gets stuck.

Phase 0: Claim your website

Before you start

Browse the website tracking sheet and find an unclaimed site. Submit the contribution form so we can lock it for you. You're expected to hear from us within 48 hours.

Phase 1: Fork, scaffold, and clone

WebHarbor lives across two repositories: code on GitHub and heavy assets (seed DBs, images) on the HuggingFace dataset ChilleD/WebHarbor. Fork both, then scaffold a new site under sites/<your_site>/:

shell
git clone https://github.com/<you>/WebHarbor && cd WebHarbor
./scripts/fetch_assets.sh           # pull current HF assets (~2.8 GB)
./scripts/new_site.py mysite        # scaffold sites/mysite/

The scaffold creates the standard skeleton:

layout
sites/mysite/
├── app.py              ← routes + SQLAlchemy models
├── seed_data.py        ← build-time seed (must be idempotent)
├── _health.py          ← end-to-end health check
├── templates/          ← Jinja2 templates
├── static/{css,js,icons}/        ← small UI, in git
├── static/images/                ← heavy, in HF dataset
├── instance_seed/<site>.db       ← seed DB, in HF dataset
└── tasks.jsonl                   ← benchmark tasks

Register the new site in three places (must stay in sync): websyn_start.sh (the SITES=( ... ) array), control_server.py (the SITES = [ ... ] list), and Dockerfile (EXPOSE 8101 40000-N).

Scrape structure

Map the live site's page hierarchy, navigation, and URL patterns. Output goes into scraped_data/ (gitignored).

Harvest real assets

Download product images, article photos, logos. Place under static/images/ (HF-managed). No placeholders.

Build backend

Edit app.py: SQLAlchemy models, routes, idempotent seed_database(), auth, CRUD.

Replicate frontend

Jinja2 templates that match the original site's layout, typography, and responsive behavior.

Phase 2: Design tasks (15–20 per site)

Tasks define what the environment must support. Write them to sites/<site>/tasks.jsonl using the WebVoyager schema (one JSON object per line):

tasks.jsonl
{"web_name": "Mysite", "id": "Mysite--0", "ques": "Search for ...", "web": "http://localhost:40015/", "upstream_url": "https://www.mysite.com/"}
{"web_name": "Mysite", "id": "Mysite--1", "ques": "Filter products under $30 with 4+ stars ...", "web": "http://localhost:40015/", "upstream_url": "https://www.mysite.com/"}

You can adapt tasks from existing benchmarks (WebVoyager, Online-Mind2Web) or synthesize new ones with an LLM — LLMs have strong knowledge of popular websites and make effective task generators.

Coverage principle

Tasks must cover the site's full functional breadth, not just one feature. For example, an Amazon mirror should include tasks across: searching products, filtering by category, reading reviews, adding to cart, checkout flow, order history, account settings, address management, and payment methods.

Difficulty principle

Include tasks that current frontier models cannot easily solve: multi-step workflows, disambiguation scenarios (user has 3 payment cards, which one?), cross-page reasoning, and tasks requiring visual understanding of product images or map layouts.

Phase 3: Task-driven environment evolution

Feed the tasks to a coding agent and let it evolve the environment to support each task. The agent will add routes, templates, database seeds, and form handlers on demand.

!

Task info leak

Coding agents frequently make tasks trivially easy. A "find product X" task becomes solvable without searching because the product is the only item displayed, or appears in the page title. The answer should never be visible without navigating, searching, and reading. We believe this is related to reward hacking in agent training.

!

Superficial completion

Agents often produce pages that pass automated checks but fail under real interaction: placeholder text, broken forms, missing images, search that only returns exact matches, or checkout flows that skip validation steps.

!

Insufficient distractors

If the task asks "buy an iPhone 17", the database must not contain only one phone. Seed diverse distractor items: multiple phone models, brands, and price ranges. The agent must compare and select, not just click the only option.

Phase 4: Hardening (critical)

This is where human review is indispensable. Systematically check each task against the 3 hardening dimensions:

A. De-leak answers

Task constraint values must NOT appear in card titles, search result snippets, or page headings. Push answer details to spec tables, description prose, or detail-page sections that require click-through.

B. Add near-miss distractors

For each task, ensure the search results contain items that match the query category but fail ONE constraint. Target ≤50% full-match density so agents must read specs.

C. Broaden the catalog

Search queries should return ≥6 results from multiple sub-categories. If a zip code search returns only pizza shops, add coffee shops, gyms, banks, and pharmacies in the same area.

D. Cross-field consistency

When modifying any product/item field, regenerate ALL related fields (specs, description, features, tags) from the same source of truth to prevent contradictions.

Why humans are essential

Automated tests verify that the environment supports a task, but only a human can verify it's challenging. Can the answer be found without scrolling? Does the search show only the target item? Is the layout visually faithful to the real site? These are judgment calls that require human eyes.

Phase 5: Stabilize the seed DB & ship to HuggingFace

Idempotent seeding (the byte-identical reset invariant)

Every seed_*() function in app.py / seed_data.py MUST early-return when the DB is already populated. Per-row gates are NOT enough — even a no-op db.session.commit() bumps SQLite metadata and breaks /reset/<site> byte-identity. Gate every seed function as a whole:

python
def seed_database():
    if Product.query.count() > 0:
        return                # gate the whole function
    # ... seed rows ...

def seed_benchmark_users():
    if User.query.filter_by(email='alice.j@test.com').first():
        return
    # ... seed 4 users ...

Realistic volume

Seed 50–200 items per major entity. Real sites have thousands; a small but diverse catalog preserves the browsing experience.

Scored search

Token-overlap scoring, NOT strict AND. Multi-word queries like "Boston Celtic players" fail strict matching. Count matching tokens, filter score > 0.

Test user accounts

Seed 4 benchmark users (alice.j@test.com etc., password TestPass123!) with pre-existing carts, bookmarks, orders, and profiles for auth-gated tasks.

Runtime data in DB, not JSON

HTTP handlers read from SQLAlchemy, not JSON files. Fold scrape JSON into instance_seed/<site>.db at build time via seed_data.py.

Two-repo workflow

Heavy assets (instance_seed/*.db, static/images/) live on the HuggingFace dataset ChilleD/WebHarbor, not directly in git. .assets-revision pins the exact HF commit. After your DB / images are stable, run:

shell
./scripts/extract_assets.sh ../wh-static-pr/
cd ../wh-static-pr
hf upload-large-folder <your-fork>/WebHarbor . --repo-type dataset
# open PR on https://huggingface.co/datasets/ChilleD/WebHarbor
# after merge, bump .assets-revision in the code repo

Phase 6: Pre-PR checks & submission

Run all of these before opening the GitHub PR:

shell
# 1. syntax
python3 -m py_compile sites/<site>/app.py

# 2. build
./scripts/build.sh webharbor:dev

# 3. run on alt ports
docker run -d --rm --name wh-test \
  -p 8201:8101 -p 41000-41014:40000-40014 webharbor:dev

# 4. all 15 sites return 200
for p in $(seq 41000 41014); do
  curl -so /dev/null -w "$p:%{http_code}\n" http://localhost:$p/
done

# 5. byte-identical reset invariant
curl -X POST http://localhost:8201/reset/<site>
docker exec wh-test md5sum \
  /opt/WebSyn/<site>/instance/<site>.db \
  /opt/WebSyn/<site>/instance_seed/<site>.db
# the two md5s MUST match

docker stop wh-test

Your GitHub PR description should include:

  1. The real site mirrored + URL
  2. Number of seeded rows per major model
  3. Link to the paired HuggingFace PR (asset side)
  4. Output of POST /reset/<site> showing ready: true
  5. Screenshot evidence of visual fidelity vs. the real site
  6. 15–20 tasks in sites/<site>/tasks.jsonl
Submit contribution form Back to Contribution