Kalil Sama Bouzigues

Browser Agents

Master Student

About

Kalil Sama Bouzigues is a computer science and cybersecurity researcher with a strong builder mindset and a clear focus on how intelligent systems interact with the real web. Currently pursuing a master’s degree at EPFL and spending a semester at ETH Zurich through a dual-diploma exchange, he works at the intersection of browser agents, web automation, APIs, and evaluation infrastructure. Alongside his academic training, he has built products such as Stapply, The Browser Arena, and reverse-api-engineer, and is also one of the top contributors to browser-use. His work is driven by a practical question: how can we make web-based AI systems more measurable, trustworthy, and useful in real-world environments?

Projects

01Browser Arena: Benchmarking AI Agents on the Real Web

Research Areas

01Browser agent evaluation

02Web automation benchmarks

03Agent reliability metrics

Connect

kalil-sama-bouzigues

kalil0321

Project

Browser Arena: Benchmarking AI Agents on the Real Web

Browser Arena is an evaluation platform designed to benchmark browser agents on realistic web tasks and compare them side by side across metrics such as speed, reliability, cost, and task completion quality. The project addresses a growing problem in the agent ecosystem: while browser-based AI systems are advancing quickly, there is still no widely adopted, transparent way to assess which agent performs best for which type of task. Browser Arena creates that missing layer by allowing agents to compete head-to-head in interactive environments, making performance differences visible in a structured and reproducible way. From a scientific perspective, the project contributes to the study of embodied AI systems operating in dynamic digital environments. Browser agents do not simply generate text; they perceive interfaces, make sequential decisions, recover from errors, and interact with changing web states. This makes their evaluation substantially more complex than standard language model benchmarking. By building infrastructure for comparative testing on live tasks and combining quantitative metrics with ranking mechanisms such as ELO-style evaluation, the project opens the door to more rigorous experimentation on agent behavior, robustness, and generalization. It also provides a useful testbed for studying failure modes, human preferences, and task-specific tradeoffs across agent architectures. The commercial relevance is equally strong. As more companies explore browser agents for operations, customer workflows, QA, research, and back-office automation, selecting the right system becomes a costly and high-stakes decision. Browser Arena offers a practical way for enterprises, model teams, and infrastructure providers to evaluate solutions in conditions that closely resemble deployment reality. This kind of benchmarking layer can become foundational for procurement, model iteration, and trust-building in agentic systems. In that sense, the platform is positioned not only as research infrastructure, but also as an enabling layer for a fast-emerging market around reliable web agents.

Other team members