Skyvern Benchmark & Review

Architecture Breakdown

Skyvern employs a hybrid architecture combining a headless Chromium browser (via Playwright) with a vision-language model (VLM) backend. The agent captures full-page screenshots and raw DOM snapshots at each step, feeding them into an LLM (e.g., GPT-4 or Claude) to decide the next action. The DOM is parsed using a custom tree-sitter grammar that extracts interactive elements (buttons, inputs, links) and their XPath/CSS selectors, enabling precise element targeting even within shadow DOM or iframes.

Anti-detection is handled through stealth plugins: playwright-stealth patches WebDriver flags, and undetected-chromedriver randomizes browser fingerprints. Skyvern also rotates user agents and viewport sizes per session.

Benchmarks & Telemetry

In internal benchmarks on AllAIBrowser.com’s standard task suite (100 tasks across e-commerce, form filling, and CAPTCHA scenarios):

DOM Parsing Speed: 2.1 seconds average to extract and serialize interactive elements from a typical page.
CAPTCHA Bypass Rate: 95% success on reCAPTCHA v2/v3 and hCaptcha, using a combination of screenshot-based reasoning and automated clicking.
Success Rate: 92% overall task completion (e.g., checkout flows, login forms). Failures often occur on sites with heavy JavaScript rendering delays or complex multi-step CAPTCHAs.
Token Cost Per Task: 15k tokens average (input + output), translating to ~$0.30 per task with GPT-4. Using Claude 3 Haiku reduces cost to ~$0.08.

Developer Experience

Skyvern offers both a Python SDK and a REST API. The SDK is well-documented with examples for common patterns (e.g., await agent.run("Add item to cart and checkout")). The agent supports custom prompts and action schemas via JSON. Logging is verbose, outputting each step’s reasoning, screenshot, and DOM state. However, debugging can be challenging due to the black-box nature of LLM decisions. The project is actively maintained on GitHub with weekly releases and a responsive community.

Conclusion

Skyvern is a powerful tool for automating complex web interactions, especially where traditional selectors fail. Its vision-based approach and anti-detection features make it suitable for scraping and testing dynamic sites. However, the token cost and latency may be prohibitive for high-volume tasks. For developers needing a robust, open-source alternative to commercial agents, Skyvern is a top contender.