WebVisionKit is a classroom-oriented framework for students who need to treat a browser as a vision-and-control problem instead of a Document Object Model (DOM) automation problem, where code works directly with the page's HTML structure.
In WebVisionKit, an app is the student-written program that receives screenshots of the rendered web page, decides what to do next, and uses the runtime action API to send browser inputs back to Chrome.
The supported runtime model is:
- Google Chrome runs on the host with DevTools remote debugging enabled.
./launch.bashstarts or reuses Chrome, validates Docker connectivity, and launches the selected app.- The
webvisionkitruntime package and the student app run inside Docker. - The container receives screenshots of the rendered website through the Chrome DevTools Protocol (CDP), which is Chrome's debugging and control API.
- The app sends actions through the WebVisionKit runtime API, which forwards them back to Chrome through the same CDP connection.
That architecture is deliberate. Students see pixels, reason about state, and act through browser input primitives. They do not depend on page selectors or site-specific DOM hooks tied to a site's internal HTML structure.
This root README is the repo overview. For the student-first documentation set, start with:
- Introductory OpenCV exercises where students need real image input.
- Tree and graph algorithm assignments that play small web games.
- Controlled browser interaction labs where students map detections to clicks, drags, scrolls, and key presses.
- Classroom environments where the browser should stay on the host, but student code should stay containerized.
The bundled games/input-lab/ fixture is the primary calibration environment. It exposes pointer, click, drag, scroll, text, and keyboard tasks with CV-friendly markers before students move on to games or arbitrary external sites.
WebVisionKit supports:
- an image stream from a Chrome page target
- frame-by-frame Python callbacks inside Docker
- browser input primitives such as click, drag, scroll, and keyboard input
- local bundled games and arbitrary URLs
WebVisionKit does not promise:
- compatibility with login flows, anti-bot protections, or sites that actively resist automation
- DOM-level page understanding
- production-grade browser automation beyond the image-stream-plus-input model
- Docker Desktop
- Google Chrome installed in
/Applications/Google Chrome.app, orCHROME_APPset to another Chrome app bundle curl- either
sha256sumorshasum
- Docker Engine or Docker Desktop
- Google Chrome or Chromium on
PATH, orCHROME_APPset explicitly curl- either
sha256sumorshasum
- WSL2
- Docker Desktop with WSL integration enabled
- Google Chrome installed on Windows
powershell.exewslpathcurl- either
sha256sumorshasum
WSL1 is not supported.
Build the runtime image once:
./infrastructure/docker/build.bashRun the environment check:
./launch.bash doctorStart the full launcher:
./launch.bashFor a bounded smoke run instead of an interactive session:
./launch.bash smoke./launch.bashor./launch.bash upFull flow: discover apps, resolve the target, ensure Chrome, and run Docker../launch.bash chromeStart or reuse Chrome with remote debugging enabled../launch.bash doctorNon-interactive prerequisite and connectivity check. It validates Docker access, Chrome discovery, curl, SHA-256 tool availability, WSL requirements, the host DevTools endpoint, and container-to-host DevTools reachability../launch.bash smokeNon-interactive bounded validation. By default it runs:frame_reportagainstgame://input-labframe_reportagainsthttps://example.cominteraction_showcaseagainstgame://input-labwithACTION_MODE=dry-run
./launch.bash containerLower-level direct container run using the current environment variables.
Student projects live under:
apps/<name>/
Each app folder must contain:
apps/<name>/app.py
Each app.py must export:
app = BrowserApp(...)The folder may also contain helper modules and assets. For classroom work, the supported pattern is a small self-contained project folder rather than a single giant script.
Most algorithmic game-playing apps are expected to be authored by students in their own app folders; the public WebVisionKit repo only ships small reference examples such as apps/simple_drag/.
The beginner-facing API remains intentionally small:
from webvisionkit import BrowserApp
def on_frame(image, context):
return {}
app = BrowserApp(
start_target="about:blank",
fps=1.0,
on_frame=on_frame,
)start_target: default launch target for the appfps: callback rateon_frame(image, context): called for each delivered frame
Supported target formats:
https://example.comabout:blankgame://input-labgame://simple_drag- other bundled local game tokens such as
game://tic-tac-toe
Inside on_frame(image, context), the context object exposes:
context.statecontext.browsercontext.streamcontext.frame_indexcontext.session_indexcontext.urlcontext.frame_widthcontext.frame_heightcontext.save_dircontext.captured_atcontext.recent_action_results
The callback may return None or a dict. Returned fields are merged into the per-frame metadata record.
context.browser exposes site-agnostic input primitives:
open(url)move(x, y)mouse_down(x, y)mouse_up(x, y)click(x, y)double_click(x, y)drag(x1, y1, x2, y2)scroll(x, y, delta_x=0, delta_y=...)key_down(key)key_up(key)key_press(key)type_text(text)pause(duration_ms)
Coordinates are always in the same image pixel space the callback receives.
apps/screenshot_capture/Periodic screenshot writer.apps/frame_report/Minimal observation app for smoke runs and debugging.apps/interaction_showcase/Framework browser-input demo for exercising the input-lab fixture and validating action execution paths.apps/simple_drag/Minimal color-detection example that drags a red block into a green goal with one high-level action call.
Bundled local targets live under games/:
game://input-labgame://simple_draggame://tic-tac-toegame://connect-4game://snakegame://memory-matchgame://2048
game://input-lab is the recommended first assignment target.
game://simple_drag is the smallest bundled drag example for learning the app API end to end.
Example launch command:
APP_NAME=simple_drag TARGET_URL_OVERRIDE=game://simple_drag ./launch.bashIMAGE_NAMEAPP_NAMETARGET_URL_OVERRIDEOUTPUT_DIRSCREENSHOT_DIRFORCE_REBUILD=1SMOKE_EXTERNAL_URL
OUTPUT_DIR defaults to the repo-relative ./output folder, not the caller’s current shell directory.
CHROME_APPCHROME_PORTCHROME_PROFILE_DIRCHROME_REMOTE_ALLOW_ORIGINS
On WSL, the default Chrome profile directory is a Windows-native path under LocalApplicationData\WebVisionKit\chrome-cdp-profile.
If Windows Chrome launch is unavailable, the launcher falls back to Linux Chrome in WSL and uses /tmp/webvisionkit-chrome-cdp-profile by default.
RECONNECT_ATTEMPTSRECONNECT_DELAY_SECONDSRECEIVE_TIMEOUT_SECONDSIDLE_TIMEOUT_SECONDSLOG_INTERVAL_SECONDSMAX_FRAMESLIVE_PREVIEWVIDEO_OUTPUTMETADATA_OUTPUTPROCESSORS
ACTION_MODE=auto|dry-run|offACTION_DEFAULT_COOLDOWN_MSACTION_MAX_PER_FRAMEACTION_DRAG_STEP_COUNTACTION_DRAG_STEP_DELAY_MS
Run non-mutating repository checks:
./scripts/check-env.shRun the unit-test suite:
./scripts/test.shThe test suite covers:
- launcher target resolution
- websocket host rewriting
- config parsing
- action validation and cooldown handling
- app discovery and helper-module imports inside
apps/<name>/
- Start Docker Desktop or the Docker daemon.
- Confirm
docker infoworks for your user. - Confirm
curlexists on the host. - Install either
sha256sumorshasum.
- macOS: set
CHROME_APPif Chrome is not in/Applications/Google Chrome.app - Linux: set
CHROME_APPto the Chrome or Chromium executable - WSL: set
CHROME_APPto the Windowschrome.exepath if Chrome is not in the default location
- macOS and Docker Desktop: the default hostname should work
- Native Linux: WebVisionKit adds
--add-host=host.docker.internal:host-gatewayautomatically - WSL2: use Docker Desktop with WSL integration enabled
- Confirm you are on WSL2, not WSL1
- Confirm Windows interop works by running a simple
powershell.exe -NoProfile -Command "Write-Output ok"from WSL - If Windows interop fails, install Linux Chrome/Chromium in WSL so launcher fallback can be used
- If Linux Chrome is missing in WSL fallback mode, the launcher auto-installs Google Chrome using:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb - Check Windows Chrome startup manually with the same remote debugging port when using the Windows backend
- Re-run
./launch.bash doctorto verify the host endpoint and the container probe separately
- The launcher tries container reachability in this order when Linux fallback mode is used: WSL gateway first, then
host.docker.internal - It automatically selects the first reachable candidate for
CHROME_HOST_IN_CONTAINER - If your setup uses a different route, set
CHROME_HOST_IN_CONTAINERexplicitly before launch - Re-run
./launch.bash doctorand check the logged "Effective CHROME_HOST_IN_CONTAINER"
On native Linux and WSL, WebVisionKit runs the container with the calling user’s UID and GID by default so output files stay user-owned. If you override container args manually, avoid removing that behavior unless you want root-owned artifacts.