MolmoWeb: Open Multimodal Visual Web Agents — benchmark.space

AI Research Roundup

MolmoWeb: Open Multimodal Visual Web Agents

2026-04-11 4min 4 views watch on youtube →

Channel: AI Research Roundup

Date: 2026-04-11

Duration: 4min

Views: 4

URL: https://www.youtube.com/watch?v=diXpkVgJjac

In this AI Research Roundup episode, Alex discusses the paper: 'MolmoWeb: Open Visual Web Agent and Open Data for the Open Web' MolmoWeb is a new family of open multimodal web agents designed to navigate the internet using only visual screenshots. Alongside the models, the researchers released MolmoWebMix, a massive dataset containing over 130,000 task demonstrations and GUI perception data. Unlike many existing agents, MolmoWeb operates without needing access to HTML or accessibility trees, mak

Welcome to the AI research roundup. I'm Alex. Today we're looking at a paper trending on X published on April 9th, 2026, just 2 days ago. It demonstrates that an open 8 billion parameter model can achieve a 95% success rate on web navigation tasks, even outperforming massive closed systems like GPT40. The paper is titled Malmo Web Open Visual Web Agent and Open Data for the Open Web. And as we'll see later, the move toward vision only models could make these agents much more reliable for the general public. Well, figure one illustrates how Momo Web processes tasks by looking directly at the screen rather than reading complex website code. The observation space includes the task instruction, the current screenshot, and a history of previous actions. Based on these inputs, the model generates a natural language thought to explain its reasoning before predicting the next specific action. This example shows the agent identifying the need to enter a destination and then outputting precise click coordinates. All right. Figure one showed the interaction loop for a single task. But

figure two details the massive data set MLM web mix that makes those interactions possible. The top sections break down graphical user interface perception including screenshot question answering and grounding which is the process of linking text descriptions to specific visual coordinates. Meanwhile, the bottom panels show the multi-step task trajectories that teach the model to sequence actions until a goal like finding a specific recipe is reached. So, while the previous figure overview the data set components, figure 4 explains the specialized multi- aent pipeline for generating training data. A planner first decomposes a broad goal into specific sub goals. These are passed to an operator which executes browser actions on the page. Finally, a verifier inspects the results to confirm if the sub goal was achieved. This iterative process allows the system to correct mistakes, which produces much cleaner training trajectories than a single model could alone. Well, figure 5 detailed the synthetic generation pipeline, but table one categorizes the foundational human data through a taxonomy of atomic web skills. These

atomic skills represent the essential building blocks such as searching or filling forms, which the model must master to complete complex browsing sequences. For example, the go-to and search skills handle initial navigation while specialized actions like apply filters or add to cart enable sophisticated interactions. This structured approach ensures that the training data provides clear targeted supervision for every fundamental operation an agent might encounter. So, table one categorized the skills and table two now provides a quantitative breakdown of the full MMO webmix data set. The trajectory data alone includes over 278,000 trajectories which total more than 2.2 million individual steps across thousands of domains. We can see that trajectories make up 80% of the final training mixture while the remaining 20% focuses on graphical user interface perception tasks. Crucially, the final column confirms that nearly all of this data is open source, helping the community build more transparent web agents. So table 2 quantified the scale of the training data and table 4 now

compares the final MMO web agents against existing models across four major browser benchmarks. These benchmarks are standardized tests used to measure how accurately an AI can navigate websites and complete user requests in the open weight category. Mulmo Web 8B sets a new record because it achieves a 78% success rate on the web Voyager task. Surprisingly, this outperforms larger proprietary systems like GPT40, which only reaches 65% on the same test. All right. While table 4 established the baseline, figure 6 explores test time scaling. Test time scaling, which refers to using extra computing power during the prediction phase to improve accuracy. Provides a significant performance boost by running four parallel attempts in picking the best outcome. The 8 billion parameter model reaches nearly 95% accuracy on the web voyager benchmark. This approach, which is shown by the rising lines, effectively fixes early mistakes that might otherwise derail a task. Overall, Momo Web proves that

highquality open data allows smaller models to outperform proprietary giants. By relying entirely on visual screenshots, these agents become more robust and easier to understand. Releasing the full training mixture sets a new standard for transparency in the field. That is it for this episode of the AI research roundup. I am Alex.