Web-CogReasoner

Towards Knowledge-Induced Cognitive Reasoning for Web Agents

1Southwestern University of Finance and Economics, 2Shanghai Jiao Tong University, 3Central South University, 4Hithink Research, 5Westlake University, 6Harbin Institute of Technology, 7University of Manchester, 8University of California, Los Angeles, 9University of Adelaide, 10Fudan University, 11Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
overview

Web-CogReasoner introduces a paradigm shift from simply enhancing web agents to systematically building their cognitive abilities from the ground up. Inspired by Bloom's Taxonomy, we decomposes agent capabilities into knowledge content learning (Factual, Conceptual) and cognitive processes (Procedural), enabling interpretable and goal-directed behavior. Built upon large multimodal models, it performs knowledge-driven Chain-of-Thought (CoT) reasoning across complex web tasks, where each reasoning step is transparently grounded in a specific knowledge type, ensuring both interpretability and robust generalization.

In summary, our contributions are threefold:

  • Drawing inspiration from Bloom’s taxonomy and established human educational paradigms, we propose the Web-CogKnowledge Framework — a systematic, two-stage training methodology designed to enhance the cognitive capabilities of web agents. Built upon this framework, we develop Web-CogReasoner. Rigorous benchmarking demonstrates that agents trained under our framework achieve a significant performance improvement over current state-of-the-art models.
  • We construct the Web-CogDataset, a structured curriculum consisting of 12 fine-grained and progressively challenging tasks. These tasks are meticulously designed to incrementally build the agent’s web knowledge, cognition capability, and higher-order reasoning.
  • To enable comprehensive and robust evaluation, we introduce Web-CogBench, a novel benchmark specifically designed to assess whether a web agent possesses the requisite prior knowledge and cognitive capabilities for effective web navigation. This benchmark will be released publicly to foster further research in this area.

Abstract

Multimodal large-scale models have significantly advanced the development of web agents, enabling them to perceive and interact with the digital environment in a manner analogous to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent’s capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose the Web-CogKnowledge Framework, which categorizes knowledge into Factual, Conceptual, and Procedural domains. In this framework, knowledge content learning corresponds to the agent’s processes of Memorizing and Understanding, which rely on the former two types of knowledge respectively, representing the “what” of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the “how” of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill the core knowledge necessary for the web agent. This dataset serves as the agent’s conceptual grounding—the “nouns” upon which comprehension is built—as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, particularly in its capacity for generalization to unseen tasks where its structured knowledge proves decisive. To facilitate rigorous and systematic evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities.

To-Do List

Last Updated: 2025-08-05 13:06 UTC+8

  • Paper: Release the full research paper on arXiv.
  • Code: Open-source the complete code for training and inference.
  • Model: Publish the official Web-CogReasoner model weights.
  • Dataset: Make the Web-CogDataset publicly available for community research.
  • Benchmark: Launch a public online evaluation server for Web-CogBench.

News

[2025-08-03] Our project website has been launched.

[2025-08-05] We have released the research paper on arXiv.

Details

Web-CogDataset

To facilitate knowledge acquisition for web agents, we constructed the Web-CogDataset, a large-scale, structured curriculum. This dataset is curated from 14 real-world websites and engineered into 12 fine-grained tasks designed to systematically instill the core knowledge necessary for a web agent. It serves as the agent's conceptual grounding—the “nouns” upon which comprehension is built—as well as the basis for learning how to reason and act.

The dataset is hierarchically organized according to our Web-CogKnowledge Framework, which categorizes tasks into three progressive levels of knowledge granularity:

Illustration of Web-CogDataset tasks

An illustration of the diverse tasks within the Web-CogDataset.

  • Factual Web Knowledge (81K samples): These tasks focus on the "what" of web interaction, training the agent to identify attributes of web elements and predict the immediate consequences of a single action. Tasks include Element Attribute Recognition, Sub-elements Prediction, Page Change Prediction, Next Page Prediction, and Source Element Prediction.
  • Conceptual Web Knowledge (62K samples): Moving from perception to comprehension, these tasks train the agent to synthesize facts into meaningful patterns and understand the "why" behind the interface. Tasks include Element Understanding, WebPage Understanding, and Caption & QA.
  • Procedural Web Knowledge (27K samples): This final layer transitions the agent from understanding to action. It teaches the agent the "how" of task completion by training it to formulate and execute goal-oriented plans. Tasks include User's Intention Prediction, Popup Close, Single-Step Web Task, and Noisy Multi-Steps Web Task.

Web-CogBench

To facilitate rigorous and systematic evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite. While our training dataset is structured by knowledge types, Web-CogBench assesses agent performance through the lens of three corresponding cognitive abilities: Memorizing, Understanding, and Exploring. This benchmark is curated from a representative subset of our Web-CogDataset and is designed to measure how well an agent translates its learned knowledge into practical skills within complex web contexts.

Task Cognition Metric #Num
Element Attribute Rec Memorizing ROUGE-L 249
Next Page Pre Accuracy 93
Source Element Pre Accuracy 32
Element Understanding Understanding LVM Judge 200
WebPage Understanding LVM Judge 77
User’s Intention Pre Exploring LVM Judge 105
Popup Close Accuracy 58
Single Step Exploration Accuracy 62
Total - - 876

The three evaluation dimensions directly mirror our hierarchical knowledge framework:

  • Memorizing: Assesses the agent's ability to recall and recognize concrete information, directly corresponding to the acquisition of Factual Knowledge. It evaluates whether the agent can accurately identify the attributes of web elements and the state of a webpage.
  • Understanding: Measures the agent's capacity for semantic interpretation, aligning with the mastery of Conceptual Knowledge. It tests whether the agent can move beyond mere identification to comprehend the function of elements and the contextual relationships within a page.
  • Exploring: Evaluates the agent's ability to plan and execute goal-oriented actions, reflecting the application of Procedural Knowledge. It assesses whether the agent can strategically navigate the web, handle interruptions, and complete multi-step tasks to fulfill user goals.

Knowledge-driven Chain-of-Thought (CoT)

To instantiate our hierarchical framework in a reasoning-capable agent, we introduce a structured cognitive process: the Knowledge-driven Chain-of-Thought (CoT). Unlike conventional end-to-end models that produce opaque or hallucinated justifications, our CoT is a scaffolded reasoning template where each segment is explicitly grounded in a distinct type of web knowledge. This design ensures interpretability, traceability, and faithful knowledge alignment.

The framework of Knowledge-driven Chain-of-Thought

The Web-CogReasoner framework, illustrating the Knowledge-driven CoT process from observation to a concrete action.

The CoT construction follows a layered progression, with each stage corresponding to a cognitive function defined in our Web-CogKnowledge taxonomy:

  • Factual Knowledge (blue): This layer constitutes the foundation of reasoning by anchoring the model in observable page facts. The agent answers the question: "What is on the page?" by extracting concrete interface features like DOM structure and ARIA attributes.
  • Conceptual Knowledge (green): This layer builds upon the factual base by introducing semantic understanding. The agent answers: "What does this element mean?" and "How does it contribute to the task?" by interpreting the purpose of UI components and predicting their behavior.
  • Procedural Knowledge (yellow): This layer informs goal-oriented planning and task decomposition. The agent answers: "How should the task be accomplished?" by inferring user intent, predicting the next sub-goal, and generating a sequential plan.

By modularizing the CoT along these knowledge axes, Web-CogReasoner achieves structured planning grounded in a cognitively coherent framework. This workflow encapsulates the entire chain from perception to action: Task Prompt → Knowledge-driven CoT Reasoning → Planning → Action.

Performance

Cognitive & Visual Benchmarks

This comparison highlights our model's strength in reasoning, a crucial capability that visual-centric models may lack.

Model Web-CogBench (Cognition) VisualWebBench (Vision)
Proprietary Models
Claude Sonnet 4 76.8% 85.9%
Gemini 2.5 Pro 80.2% 86.6%
Open-Source Models
Qwen2.5-VL-7B 69.8% 76.0%
UI-TARS-7B-SFT 46.4% 86.0%
Web-CogReasoner (Ours) 84.4% 86.3%

Key Insight: While some models like UI-TARS excel at visual tasks (VisualWebBench: 86.0%), they struggle with reasoning-intensive tasks (Web-CogBench: 48.2%). This highlights that strong visual perception does not guarantee advanced cognitive capabilities—a gap our work aims to fill.

Online Web Task

This section evaluates the models' ability to perform complex, multi-step tasks in live web environments.

Model WebVoyager (Generalization) Mind2Web (Cross-task) Mind2Web (Cross-web)
Proprietary Models
Claude Sonnet 4 47.7% 40.2% 21.7%
Gemini 2.5 Pro 54.9% 37.5% 25.5%
Open-Source Models
Qwen2.5-VL-7B 2.2% 1.0% 1.0%
OpenWebVoyagerIL 18.1% 6.3% 6.6%
Web-CogReasoner (Ours) 30.2% 17.0% 10.1%

BibTeX


    @article{guo2025web,
      title={Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents},
      author={Guo, Yuhan and Guo, Cong and Sun, Aiwen and He, Hongliang and Yang, Xinyu and Lu, Yue and Zhang, Yingji and Guo, Xuntao and Zhang, Dong and Liu, Jianzhuang and others},
      journal={arXiv preprint arXiv:2508.01858},
      year={2025}
    }