Show HN: Web-eval-agent – Let the coding agent debug itself

Hey HN! We’ve been building an MCP server to help AI-assisted web app developers by using browser agents to test whether changes made by an AI inside an editor actually work. We've been testing it on scenarios like verifying new flows in a UI, or checking that sending a chat request triggers a response. The idea is to let your coding agent both code and evaluate if what it did was correct. Here’s a short demo with Cursor: " rel="nofollow">

When building apps, we found the hardest part of AI-assisted coding isn’t the coding—it’s tedious point-and-click testing to see if things work. We got tired of this loop: open the app, click through flows, stare at the network tab, copy console errors to the editor, repeat. It felt obvious this should be AI-assisted too. If you can vibe-code, you should be able to vibe-test!

Some agents like Cline and Windsurf have browser integrations, but Cline’s (via Anthropic Computer Use) felt slow and only reported console logs, and Windsurf’s didn’t work reliably yet. We got so tired of manually testing that we decided to fix it.

Our MCP server sits between your IDE agent (Cursor/Windsurf/Cline/Continue) and a Playwright-powered browser-use agent. It spins up the browser, navigates your app per instructions from the IDE agent, and sends back steps, console events, and network events so the IDE agent can assess the app’s state.

We proxy Browser-use’s original Claude calls and swap in Gemini Flash 2.0, cutting latency from ~8s → ~3s per step. We also cap console/network logs at 10,000 characters to stay within context limits, and filter out irrelevant logs (e.g., noisy XHR requests).

At the end, the browser agent outputs a summary like:

  Web Evaluation Report for http://localhost:5173 
  Task: delete an API key and evaluate UX
  Steps: Home → Login → API Keys → Create Key → Delete Key
  Flow tested successfully; UX had problems X, Y, Z...
  Console (8)...   Network (13)...   Timeline of events (57) …

This gives the coding agent the ability to recognize the console and network errors, or any issues with clicking around, and have the coding agent fix them before returning back to the user. (There’s a longer example in the README at https://github.com/Operative-Sh/web-eval-agent.)

Try it in Cursor / Cline / Windsurf / Claude Desktop: (macOS/Linux):

  curl -LSf https://operative.sh/install.sh -o install.sh
  less -N install.sh   # inspect if you’d like
  bash install.sh      # installs uv + jq + Playwright + server
  # then in Cursor/Cline/Windsurf/Continue: craft a prompt using the web_eval_agent tool

(For Windows, there’s a 4-line manual install in the README.)

What we want to do next: pause/go for OAuth screens; save/load browser auth states; Playwright step recording for automated test creation and regression test creation; supporting Loveable / v0 / Bolt.new sites by offering a web version.

We’d love to hear your feedback, especially if you’ve experienced the pain of having to manually test changes happening in your web apps after making changes from inside your IDE, or if you’ve tried any alternative MCP tools for this that have worked well.

Try it out if you feel it’d be helpful for your workflow: https://github.com/Operative-Sh/web-eval-agent. (note: the server hits our operative.sh proxy to cover Gemini tokens. The MCP server itself is OSS; Anthropic base-URL support is coming soon. Free tier included; heavy users can grab the $10 plan to offset our model bill.)

Let us know what you think! Thanks for reading!

Comments URL: https://news.ycombinator.com/item?id=43822659

Points: 33

# Comments: 7

https://github.com/Operative-Sh/web-eval-agent

Creado 5h | 28 abr 2025, 18:10:15

Inicia sesión para agregar comentarios