Evolving GUI Web Agent Environments

WebHarbor: Docking Real Websites for
Evolving GUI Agent Environments

Live websites are often noisy, blocked by reCAPTCHA, and hide their most useful features behind login walls. WebHarbor docks them into local Docker mirrors: stable, reproducible, resettable in milliseconds, and evolves environments with more functionalities and data as agents grow stronger.

  • Stable & reproducible no network noise, no content drift, no geo-blocks
  • Deep features unlocked carts, checkouts, and accounts, all testable
  • Evolving harder tasks drive richer mirrors, environment evolves
  • RL-ready sub-second database resets for lightweight environments
  • Community-driven 15 sites today, let's scale to 100 and beyond, together

Start deploying 15 WebVoyager sites locally with one command:

docker run -p 40000-40014:40000-40014 battalion7244/webharbor:latest
02 · Motivation

The web is multimodal, deep, and gated. Current benchmarks and environments are none of these.

Real websites have rich visual layouts, product images, interactive maps, and deeply nested functionality behind login walls. Existing web agent benchmarks either fight the live web’s instability or retreat into toy environments that don’t resemble it.

a

Live-site eval

WebVoyager, Online-Mind2Web

reCAPTCHA, geo-blocks, network flakiness, and content drift inflate eval noise. Multiple WebVoyager sites are effectively inaccessible. Tasks never require login, so coverage is limited to surface-level fact lookups.

b

Offline traces

Mind2Web

No interactive environment, no state, no consequence of a wrong click. Useful for supervised pre-training but fundamentally unsuited to evaluating agents that must act, not just predict.

c

Synthetic Environments

WebArena

Some synthetic browser environments built from coloured rectangles instead of real images; a handful of sites with no multimodal richness. WebArena is content-rich but heavyweight, and slow resets make RL-scale rollouts impractical.

Core insight

The bottleneck isn’t the agent. It’s the environment. We need web environments that are stable (no eval noise), multimodal (real images and layouts), deep (auth-gated flows), and evolving (growing with agent capability).

How WebHarbor compares

WebArena WebVoyager Online-Mind2Web WebHarbor
Environment Self-hosted (synthetic) Live web Live web Self-hosted (synthetic)
# of Sites 5 15 100+ 15 (100+ planned)
Eval Stability Stable Noisy (reCAPTCHA, geo-blocks, tasks outdated) Noisy (reCAPTCHA, geo-blocks) Stable
Login & Deep Features Fully unlocked Not supported Not supported Fully unlocked
Reset Speed Slow (Docker restart) N/A (live) N/A (live) Fast (SQLite reset)
Evolution Static, frozen Tasks outdated Tasks updated Evolvingenv grows with tasks
RL Compatible Impractical (slow reset) No (noisy, live) No (noisy, live) Yes (fast reset, deterministic)
03 · Evolving Environments: Our Approach

Coding agents build. Humans verify. Tasks drive evolution.

WebHarbor environments are built by coding agents (e.g. Claude Code with specialized skills), then rigorously reviewed by human annotators to ensure multimodal fidelity and functional depth. Tasks from existing benchmarks (starting with WebVoyager) define what the environment support and look like.

1

Coding agent builds

A coding agent with specialized skills scrapes the real site's structure, assets, and catalog, then generates a full-stack mirror — SQLite database, REST backend, and frontend — with auth and CRUD flows.

2

Human-in-the-loop review

Human reviewers verify every mirror for visual fidelity, functional correctness, and data quality. This step is essential: we observe that agents frequently take shortcuts, produce superficially correct but broken pages, or skip edge cases.

3

Task-driven scoping

Cloning an entire live website is impractical. Instead, tasks define what features the mirror must support. Every task is verified to run against the environment with grounded truth checks.

4

Environment evolves

Once agents master existing tasks, we can have harder ones that require deeper site features. The coding agent extends the mirror on demand. The environment is never "done"; it grows alongside agent capability.

Why humans are essential

Building high-quality, multimodal web environments cannot be fully automated, currently. We observe that coding agents frequently hack their way through: using placeholder images, skipping complex layouts, producing pages that pass automated checks but fail under real interaction. Human review catches these issues and ensures every mirror matches the look, feel, and depth of the real site.

Why evolving

A live website like Amazon has millions of products, thousands of routes, and decades of accumulated functionality. Cloning it entirely is neither tractable nor necessary. But we can ensure every task has full environment support. When agents master those tasks, we expand: harder tasks → more features, richer data, deeper flows. The environment evolves with the agent.

Our vision

Leverage code to mirror the world. Cultivate the agent. Prepare it to interact with and learn from the real world itself.

04 · Quickstart

One command, 15 sites.

All mirrors ship in a single Docker image. Each site binds to a fixed port offset from 40000; point your agent at http://localhost:40000–40014.

shell
docker run -p 40000-40014:40000-40014 battalion7244/webharbor:latest

Sites are mapped to ports :40000 through :40014 in order: Allrecipes, Amazon, Apple, ArXiv, BBC News, Booking, GitHub, Google Flights, Google Maps, Google Search, Hugging Face, Wolfram Alpha, Cambridge Dictionary, Coursera, and ESPN.

05 · Contribution

Scale to 100+ sites with community power.

We have built 15 high-quality website environments that fully support the WebVoyager benchmark. Our next goal is to scale to 100+ websites, covering all environments in Online-Mind2Web (147 sites). We are inviting the community to build this together.

Two tracks to join the author list

Track A: Contribute a new website

Using coding agent to build a new mirror site (frontend + backend + database + tasks). Contributing one website qualifies you for consideration for the final paper's author list, subject to review and quality standards.

  1. Browse the website tracking sheet and find an unclaimed site you're interested in.
  2. Submit the contribution form to claim it. We will lock it to prevent duplicate work.
  3. Once confirmed, follow our Website Contribution Guide to build and submit a PR.

Track B: Review environments

Review submitted mirror sites by checking visual fidelity, functional correctness, and task grounding. Reviewing 5 environments earns a spot on the author list.

  1. Browse open Pull Requests on GitHub.
  2. Review the submitted environment: does it support all proposed tasks? Are the tasks meaningful and challenging?
  3. Follow our recommended Review Pipeline for systematic verification.
Acknowledgement

Any improvement to existing environments (bug fixes, UI polish, data enrichment, task suggestions, or even feedback) qualifies for the paper's acknowledgement section. Every contribution matters.

View website tracking sheet Submit contribution form Review open PRs
Citation

BibTeX

bibtex
@misc{webharbor2026,
  title        = {WebHarbor: Docking Real Websites for Evolving GUI Agent Environments},
  author       = {{WebHarbor Team and Contributors}},
  year         = {2026},
  url          = {https://aiming-lab.github.io/webharbor.github.io},
  note         = {Project website.}
}