Web-CogReasoner introduces a paradigm shift from simply enhancing web agents to systematically building their cognitive abilities from the ground up. Inspired by Bloom's Taxonomy, we decomposes agent capabilities into knowledge content learning (Factual, Conceptual) and cognitive processes (Procedural), enabling interpretable and goal-directed behavior. Built upon large multimodal models, it performs knowledge-driven Chain-of-Thought (CoT) reasoning across complex web tasks, where each reasoning step is transparently grounded in a specific knowledge type, ensuring both interpretability and robust generalization.
In summary, our contributions are threefold:
Multimodal large-scale models have significantly advanced the development of web agents, enabling them to perceive and interact with the digital environment in a manner analogous to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent’s capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose the Web-CogKnowledge Framework, which categorizes knowledge into Factual, Conceptual, and Procedural domains. In this framework, knowledge content learning corresponds to the agent’s processes of Memorizing and Understanding, which rely on the former two types of knowledge respectively, representing the “what” of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the “how” of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill the core knowledge necessary for the web agent. This dataset serves as the agent’s conceptual grounding—the “nouns” upon which comprehension is built—as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, particularly in its capacity for generalization to unseen tasks where its structured knowledge proves decisive. To facilitate rigorous and systematic evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities.
Last Updated: 2025-08-05 13:06 UTC+8
[2025-08-03] Our project website has been launched.
[2025-08-05] We have released the research paper on arXiv.
To facilitate knowledge acquisition for web agents, we constructed the Web-CogDataset, a large-scale, structured curriculum. This dataset is curated from 14 real-world websites and engineered into 12 fine-grained tasks designed to systematically instill the core knowledge necessary for a web agent. It serves as the agent's conceptual grounding—the “nouns” upon which comprehension is built—as well as the basis for learning how to reason and act.
The dataset is hierarchically organized according to our Web-CogKnowledge Framework, which categorizes tasks into three progressive levels of knowledge granularity:
An illustration of the diverse tasks within the Web-CogDataset.
To facilitate rigorous and systematic evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite. While our training dataset is structured by knowledge types, Web-CogBench assesses agent performance through the lens of three corresponding cognitive abilities: Memorizing, Understanding, and Exploring. This benchmark is curated from a representative subset of our Web-CogDataset and is designed to measure how well an agent translates its learned knowledge into practical skills within complex web contexts.
Task | Cognition | Metric | #Num |
---|---|---|---|
Element Attribute Rec | Memorizing | ROUGE-L | 249 |
Next Page Pre | Accuracy | 93 | |
Source Element Pre | Accuracy | 32 | |
Element Understanding | Understanding | LVM Judge | 200 |
WebPage Understanding | LVM Judge | 77 | |
User’s Intention Pre | Exploring | LVM Judge | 105 |
Popup Close | Accuracy | 58 | |
Single Step Exploration | Accuracy | 62 | |
Total | - | - | 876 |
The three evaluation dimensions directly mirror our hierarchical knowledge framework:
To instantiate our hierarchical framework in a reasoning-capable agent, we introduce a structured cognitive process: the Knowledge-driven Chain-of-Thought (CoT). Unlike conventional end-to-end models that produce opaque or hallucinated justifications, our CoT is a scaffolded reasoning template where each segment is explicitly grounded in a distinct type of web knowledge. This design ensures interpretability, traceability, and faithful knowledge alignment.
The Web-CogReasoner framework, illustrating the Knowledge-driven CoT process from observation to a concrete action.
The CoT construction follows a layered progression, with each stage corresponding to a cognitive function defined in our Web-CogKnowledge taxonomy:
By modularizing the CoT along these knowledge axes, Web-CogReasoner achieves structured planning grounded in a cognitively coherent framework. This workflow encapsulates the entire chain from perception to action: Task Prompt → Knowledge-driven CoT Reasoning → Planning → Action.
This comparison highlights our model's strength in reasoning, a crucial capability that visual-centric models may lack.
Model | Web-CogBench (Cognition) | VisualWebBench (Vision) |
---|---|---|
Proprietary Models | ||
Claude Sonnet 4 | 76.8% | 85.9% |
Gemini 2.5 Pro | 80.2% | 86.6% |
Open-Source Models | ||
Qwen2.5-VL-7B | 69.8% | 76.0% |
UI-TARS-7B-SFT | 46.4% | 86.0% |
Web-CogReasoner (Ours) | 84.4% | 86.3% |
Key Insight: While some models like UI-TARS
excel at visual tasks (VisualWebBench
: 86.0%), they struggle with reasoning-intensive tasks (Web-CogBench
: 48.2%). This highlights that strong visual perception does not guarantee advanced cognitive capabilities—a gap our work aims to fill.
This section evaluates the models' ability to perform complex, multi-step tasks in live web environments.
Model | WebVoyager (Generalization) | Mind2Web (Cross-task) | Mind2Web (Cross-web) |
---|---|---|---|
Proprietary Models | |||
Claude Sonnet 4 | 47.7% | 40.2% | 21.7% |
Gemini 2.5 Pro | 54.9% | 37.5% | 25.5% |
Open-Source Models | |||
Qwen2.5-VL-7B | 2.2% | 1.0% | 1.0% |
OpenWebVoyagerIL | 18.1% | 6.3% | 6.6% |
Web-CogReasoner (Ours) | 30.2% | 17.0% | 10.1% |
@article{guo2025web,
title={Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents},
author={Guo, Yuhan and Guo, Cong and Sun, Aiwen and He, Hongliang and Yang, Xinyu and Lu, Yue and Zhang, Yingji and Guo, Xuntao and Zhang, Dong and Liu, Jianzhuang and others},
journal={arXiv preprint arXiv:2508.01858},
year={2025}
}