WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration

Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, Volker Tresp

The paper “WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration” presents a significant advancement in the capabilities of autonomous web agents through an innovative dual optimization strategy leveraging Monte Carlo Tree Search (MCTS). WebPilot addresses the shortcomings of existing large language model (LLM)-based agents by decomposing complex tasks into manageable subtasks via Hierarchical Task Decomposition (HTD) and refining these tasks through Reflective Task Adjustment (RTA). For local optimization, it employs a customized MCTS using techniques like Goal-Oriented Selection (GOS) and Reflection-Enhanced Node Expansion (RENE). This approach mimics human-like adaptability that allows WebPilot to navigate dynamic web environments with greater flexibility and efficiency. The system is further enhanced with a Hierarchical Reflection Mechanism and a Granular Bifaceted Self-Reward Mechanism, facilitating more precise decision-making and continuous improvement. These innovations have resulted in state-of-the-art performance on benchmarks such as WebArena and MiniWoB++, achieving a 93% increase in success rate over existing methods. For long-horizon tasks, WebPilot’s strategies of breaking down tasks and dynamically adjusting plans using comprehensive information sources ensure sustained focus and adaptability. This paper is a must-read for its novel methodologies, substantial performance improvements, and potential to set new standards in the field of autonomous web task execution. Future research could build on this foundation by exploring visual data integration and further scalability enhancements.

Mind Map

graph LR
root["WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration"]
root --> research_question["Research Question/Objective"]
root --> methodology["Methodology"]
root --> findings["Key Findings/Contributions"]
root --> theoretical_framework["Theoretical Framework"]
root --> data_analysis["Data and Analysis"]
root --> results_discussion["Results and Discussion"]
root --> implications["Implications"]
root --> limitations["Limitations"]
root --> future_research["Future Research Directions"]

methodology -.-> global_optimization["Global Optimization"]
methodology -.-> local_optimization["Local Optimization"]
global_optimization -.-> htd["Hierarchical Task Decomposition (HTD)"]
global_optimization -.-> rta["Reflective Task Adjustment (RTA)"]
local_optimization -.-> mcts_strategy["MCTS-Enhanced Decision Strategies"]
mcts_strategy -.-> gos["Goal-Oriented Selection (GOS)"]
mcts_strategy -.-> rene["Reflection-Enhanced Node Expansion (RENE)"]
mcts_strategy -.-> des["Dynamic Evaluation and Simulation (DES)"]
mcts_strategy -.-> mvb["Maximal Value Backpropagation (MVB)"]

findings -.-> webpilot_intro["Introduction of WebPilot"]
findings -.-> hierarchical_reflection["Hierarchical Reflection Mechanism"]
findings -.-> self_reward_mechanism["Granular Bifaceted Self-Reward Mechanism"]
findings -.-> sota_performance["State-of-the-Art Performance"]

theoretical_framework -.-> llms["Large Language Models (LLMs)"]
theoretical_framework -.-> mcts["Monte Carlo Tree Search (MCTS)"]
theoretical_framework -.-> pomdp["Partially Observable Markov Decision Process (POMDP)"]
theoretical_framework -.-> cognitive_flexibility["Cognitive Flexibility"]

data_analysis -.-> benchmarks["Benchmarks"]
data_analysis -.-> performance_metrics["Performance Metrics"]
data_analysis -.-> ablation_studies["Ablation Studies"]
data_analysis -.-> behavior_analysis["Agent Behavioral Analysis"]

results_discussion -.-> webaren_benchmark["WebArena Benchmark"]
results_discussion -.-> min_wob_plus_plus["MiniWoB++ Results"]
results_discussion -.-> limitation_discuss["Limitation Discussion"]

implications -.-> adaptability["Adaptability"]
implications -.-> real_world_env["Real-world Environment Applications"]

limitations -.-> llm_dependency["Reliance on LLMs"]
limitations -.-> visual_info["Absence of Visual Information"]
limitations -.-> scalability["Scalability Concerns"]
limitations -.-> specific_failures["Failing in Specific Web Elements"]

future_research -.-> visual_integration["Incorporate Visual Data"]
future_research -.-> scalable_mcts["Scalable MCTS Techniques"]
future_research -.-> extended_benchmarks["Extended Benchmarks"]
future_research -.-> user_modeling["Advanced User Modeling"]
future_research -.-> real_time_learning["Real-time Learning"]

Highlights explained

1. Hierarchical Task Decomposition (HTD)

a. Explanation

Hierarchical Task Decomposition (HTD) is a strategy where complex web tasks are broken down into smaller, manageable subtasks. This decomposition is handled by a Planner component in the WebPilot system.

b. Significance

HTD allows WebPilot to focus on specific, smaller goals sequentially, rather than tackling an entire complex task at once. This leads to more efficient resource utilization and simplified problem-solving steps.

c. Context and Impact

HTD is crucial for dealing with vast action spaces in dynamic web environments, making it more adaptable and effective at achieving high-level task goals. This approach aligns with cognitive theories of task management and demonstrates significant performance improvements over agents lacking such decomposition strategies.

2. Monte Carlo Tree Search (MCTS)-Enhanced Decision Strategies

a. Explanation

WebPilot employs MCTS algorithms to make decisions for each subtask. Key components include Goal-Oriented Selection (GOS), Reflection-Enhanced Node Expansion (RENE), Dynamic Evaluation and Simulation (DES), and Maximal Value Backpropagation (MVB).

b. Significance

These components ensure precise decision-making by balancing exploration and exploitation, continuously refining strategies based on real-time feedback, and prioritizing actions with the highest potential outcomes.

c. Context and Impact

MCTS is widely recognized for its effectiveness in game playing and robotics. By adapting these principles to web task execution, WebPilot gains a significant edge in dynamic and partially observable environments, achieving superior performance metrics on benchmarks like WebArena.

3. Granular Bifaceted Self-Reward Mechanism

a. Explanation

This mechanism involves a nuanced approach to evaluating actions through two facets: immediate effectiveness and long-term potential. It provides more granular and context-sensitive assessments of each action taken by the agent.

b. Significance

Accurate and detailed feedback allows the agent to make better decisions by understanding both short-term and long-term implications of its actions, thus enhancing overall performance in dynamic web environments.

c. Context and Impact

Incorporating such a refined reward mechanism aligns with reinforcement learning principles and significantly boosts the agent’s ability to adapt to changing conditions, offering a unique advantage over simpler reward models used in other systems.

4. Reflective Task Adjustment (RTA)

a. Explanation

RTA is a feedback loop mechanism where the agent reassesses and refines its strategy based on new observations after each subtask execution. This continuous feedback and adjustment ensure that the agent remains aligned with the overall task goal.

b. Significance

RTA enhances the adaptability of WebPilot by allowing it to correct course dynamically, responding to new information and unforeseen changes in the environment. This leads to more robust and resilient task execution.

c. Context and Impact

Reflective mechanisms are inspired by cognitive flexibility theories and are relatively novel in web task automation. They provide a significant boost in the agent’s adaptability, positioning WebPilot ahead of traditional fixed-policy agents.

5. Effective Strategies for Long-Horizon Tasks and Diverse Information Sources

Explanation

For agents like WebPilot to handle long-horizon tasks and integrate diverse information sources effectively, several strategies can be implemented.

a. Integration of Visual Data

Incorporating visual data alongside textual information can provide a more comprehensive understanding of the web environment, improving decision-making accuracy and adaptability.

b. Continual Learning Mechanisms

Implementing continual learning allows the agent to adapt and improve based on ongoing interactions and feedback, enhancing performance over time even in evolving environments.

c. Advanced User Modeling

Incorporating user behavior modeling can personalize interactions and improve the agent’s ability to predict and meet user needs more effectively.

d. Scalable MCTS Techniques

Exploring more efficient MCTS methods or hybrid approaches can help scale the decision-making process to handle larger and more complex tasks without compromising performance.

Code

The PoC implementation includes a simulated web environment of gitlab for a demo case of "Navigate to the 'Members' page of the 'dotfiles' repository on GitHub and invite 'Abishek' as a guest." It is only a very high level implementation of the paper’s reflection based rewarding for educational purposes only.

Bash

pip install openai dspy

Python

import os
import random
from typing import List, Dict, Any
import openai
import dspy
import math

# Set up OpenAI API
openai.api_key = os.getenv("OPENAI_API_KEY")

# Set up DSPy
dspy.settings.configure(lm=dspy.OpenAI(model="gpt-4o-mini"))

class WebElement:
    def __init__(self, element_type: str, text: str, attributes: Dict[str, str] = None):
        self.element_type = element_type
        self.text = text
        self.attributes = attributes or {}

class WebPage:
    def __init__(self, url: str, elements: List[WebElement]):
        self.url = url
        self.elements = elements

class WebEnvironment:
    def __init__(self):
        self.pages = {
            "dashboard": WebPage("https://gitlab.com/dashboard", [
                WebElement("link", "Projects"),
                WebElement("link", "Groups"),
                WebElement("link", "dotfiles"),
            ]),
            "dotfiles": WebPage("https://gitlab.com/byteblazeuser/dotfiles", [
                WebElement("link", "Project Information"),
                WebElement("link", "Repository"),
                WebElement("link", "Issues"),
                WebElement("link", "Members"),
            ]),
            "members": WebPage("https://gitlab.com/byteblazeuser/dotfiles/-/project_members", [
                WebElement("button", "Invite members"),
                WebElement("list", "Current members"),
            ]),
            "invite": WebPage("https://gitlab.com/byteblazeuser/dotfiles/-/project_members/new", [
                WebElement("textbox", "Username or email address"),
                WebElement("dropdown", "Choose a role permission"),
                WebElement("button", "Invite"),
            ]),
        }
        self.current_page = None
        self.browser_opened = False
        self.logged_in = False

    def get_observation(self) -> str:
        if not self.browser_opened:
            return "Web browser is not opened."
        if not self.logged_in:
            return "Web browser is opened but not logged in to GitLab."
        if self.current_page is None:
            return "Logged in to GitLab, but no specific page is open."
        page = self.pages[self.current_page]
        return f"Current URL: {page.url}\nElements: " + ", ".join([f"{e.element_type}: {e.text}" for e in page.elements])

    def take_action(self, action: str) -> str:
        action = action.lower()
        if "open" in action and "browser" in action:
            if not self.browser_opened:
                self.browser_opened = True
                return "Web browser opened successfully."
            else:
                return "Web browser is already open."
        if not self.browser_opened:
            return "Cannot perform action. Web browser is not opened."
        if "log in" in action and "gitlab" in action:
            if not self.logged_in:
                self.logged_in = True
                self.current_page = "dashboard"
                return "Logged in to GitLab. Now on dashboard."
            else:
                return "Already logged in to GitLab."
        if not self.logged_in:
            return "Cannot perform action. Not logged in to GitLab."
        if action.startswith("click "):
            element = action[6:]
            if element == "dotfiles" and self.current_page == "dashboard":
                self.current_page = "dotfiles"
            elif element == "Members" and self.current_page == "dotfiles":
                self.current_page = "members"
            elif element == "Invite members" and self.current_page == "members":
                self.current_page = "invite"
            elif element == "Invite" and self.current_page == "invite":
                return "Invitation sent successfully"
        elif action.startswith("type "):
            if self.current_page == "invite":
                return f"Typed '{action[5:]}' into the textbox"
        return self.get_observation()

class Planner(dspy.Module):
    def __init__(self):
        super().__init__()
        self.plan = dspy.ChainOfThought("task -> detailed_plan")

    def forward(self, task: str) -> List[str]:
        result = self.plan(task=task)
        plan = [step.strip() for step in result.detailed_plan.split('\n') if step.strip()]
        if len(plan) < 3 or not plan[-1].endswith('.'):
            plan.append("Complete any remaining steps to fulfill the task.")
        return plan

class Controller(dspy.Module):
    def __init__(self):
        super().__init__()
        self.assess = dspy.ChainOfThought("subtask, actions, observation -> completeness, reflection")

    def forward(self, subtask: str, actions: List[str], observation: str) -> Dict[str, Any]:
        result = self.assess(
            subtask=subtask,
            actions=", ".join(actions),
            observation=observation
        )
        completeness = result.completeness.lower()
        if "complete" in completeness and self.subtask_goal_achieved(subtask, observation):
            completeness = "complete"
        elif "partial" in completeness or len(actions) > 0:
            completeness = "partial"
        else:
            completeness = "incomplete"
        return {
            "completeness": completeness,
            "reflection": result.reflection
        }

    def subtask_goal_achieved(self, subtask: str, observation: str) -> bool:
        subtask_lower = subtask.lower()
        if "open a web browser" in subtask_lower:
            return "Web browser opened" in observation
        elif "log in" in subtask_lower:
            return "Logged in to GitLab" in observation
        elif "find the 'dotfiles' repository" in subtask_lower:
            return "Current URL: https://gitlab.com/byteblazeuser/dotfiles" in observation
        elif "members page" in subtask_lower:
            return "Current URL: https://gitlab.com/byteblazeuser/dotfiles/-/project_members" in observation
        elif "invite" in subtask_lower:
            return "Invitation sent successfully" in observation
        return False

class Explorer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_action = dspy.ChainOfThought("observation, subtask, history, reflections -> action, intent")
        self.analyze_effect = dspy.ChainOfThought("previous_observation, current_observation, intent -> effect")
        self.generate_reflection = dspy.ChainOfThought("observation, subtask, action, effect -> child_reflection, sibling_reflection")

    def forward(self, observation: str, subtask: str, history: List[str], reflections: Dict[str, str]) -> Dict[str, str]:
        result = self.generate_action(
            observation=observation,
            subtask=subtask,
            history=", ".join(history),
            reflections=str(reflections)
        )
        action = result.action
        
        # Check if the action has been repeated and adjust if necessary
        if action in history:
            if "open" in action.lower() and "browser" in action.lower():
                action = "Go to the GitLab website"
            elif "log in" in action.lower():
                action = "Navigate to the GitLab dashboard"
            else:
                action = f"Try alternative action for: {action}"
        
        return {"action": action, "intent": result.intent}

    def analyze(self, previous_observation: str, current_observation: str, intent: str) -> str:
        result = self.analyze_effect(
            previous_observation=previous_observation,
            current_observation=current_observation,
            intent=intent
        )
        return result.effect

    def reflect(self, observation: str, subtask: str, action: str, effect: str) -> Dict[str, str]:
        result = self.generate_reflection(
            observation=observation,
            subtask=subtask,
            action=action,
            effect=effect
        )
        return {
            "child_reflection": result.child_reflection,
            "sibling_reflection": result.sibling_reflection
        }

class Appraiser(dspy.Module):
    def __init__(self):
        super().__init__()
        self.assess = dspy.ChainOfThought("effect, observation, subtask -> effectiveness, future_promise, reasoning")

    def forward(self, effect: str, observation: str, subtask: str) -> Dict[str, float]:
        result = self.assess(effect=effect, observation=observation, subtask=subtask)
        
        # Ensure effectiveness and future_promise are numeric
        try:
            effectiveness = float(result.effectiveness)
        except ValueError:
            effectiveness = self.interpret_score(result.effectiveness)

        try:
            future_promise = float(result.future_promise)
        except ValueError:
            future_promise = self.interpret_score(result.future_promise)

        return {
            "effectiveness": effectiveness,
            "future_promise": future_promise,
            "reasoning": result.reasoning
        }

    def interpret_score(self, assessment: str) -> float:
        assessment = assessment.lower()
        if "no" in assessment or "fail" in assessment:
            return 0.0
        elif "low" in assessment or "minor" in assessment:
            return 3.0
        elif "moderate" in assessment or "partial" in assessment:
            return 5.0
        elif "high" in assessment or "significant" in assessment:
            return 8.0
        elif "complete" in assessment or "perfect" in assessment:
            return 10.0
        else:
            return 5.0  # Default to moderate if unclear

class MCTSNode:
    def __init__(self, state, parent=None):
        self.state = state
        self.parent = parent
        self.children = []
        self.visits = 0
        self.value = 0

    def add_child(self, child_state):
        child = MCTSNode(child_state, self)
        self.children.append(child)
        return child

    def update(self, reward):
        self.visits += 1
        self.value += reward

    def fully_expanded(self):
        return len(self.children) > 0

    def best_child(self, c_param=1.4):
        choices_weights = [
            (c.value / c.visits) + c_param * ((math.log(self.visits) / c.visits) ** 0.5)
            for c in self.children
        ]
        return self.children[choices_weights.index(max(choices_weights))]

class MCTS:
    def __init__(self, explorer, appraiser, environment):
        self.explorer = explorer
        self.appraiser = appraiser
        self.environment = environment
        self.root = None

    def search(self, initial_state, subtask, n_iterations=100):
        self.root = MCTSNode(initial_state)

        for _ in range(n_iterations):
            node = self.select(self.root)
            child = self.expand(node, subtask)
            reward = self.simulate(child, subtask)
            self.backpropagate(child, reward)

        return self.best_action(self.root)

    def select(self, node):
        while node.fully_expanded():
            node = node.best_child()
        return node

    def expand(self, node, subtask):
        action_info = self.explorer(node.state, subtask, [], {})
        new_state = self.environment.take_action(action_info["action"])
        return node.add_child(new_state)

    def simulate(self, node, subtask):
        current_state = node.state
        depth = 0
        while depth < 5:  # Limit simulation depth
            action_info = self.explorer(current_state, subtask, [], {})
            new_state = self.environment.take_action(action_info["action"])
            effect = self.explorer.analyze(current_state, new_state, action_info["intent"])
            appraisal = self.appraiser(effect, new_state, subtask)
            if appraisal["effectiveness"] >= 8:  # Threshold for successful simulation
                return 1
            current_state = new_state
            depth += 1
        return 0

    def backpropagate(self, node, reward):
        while node is not None:
            node.update(reward)
            node = node.parent

    def best_action(self, node):
        return max(node.children, key=lambda c: c.visits).state

class WebPilot:
    def __init__(self):
        self.planner = Planner()
        self.controller = Controller()
        self.explorer = Explorer()
        self.appraiser = Appraiser()
        self.environment = WebEnvironment()
        self.mcts = MCTS(self.explorer, self.appraiser, self.environment)
        self.action_history = []
        self.max_repeated_actions = 3
        self.subtask_attempt_limit = 7

    def execute_task(self, task: str):
        task = task.replace("GitHub", "GitLab")
        subtasks = self.planner(task)
        print(f"Generated plan: {subtasks}")

        for subtask in subtasks:
            print(f"\nExecuting subtask: {subtask}")
            self.action_history.clear()
            observation = self.environment.get_observation()
            reflections = {}

            for attempt in range(self.subtask_attempt_limit):
                mcts_result = self.mcts.search(observation, subtask)
                action_info = self.explorer(mcts_result, subtask, self.action_history, reflections)
                action = action_info["action"]

                if action.lower() == "no action needed":
                    print("No action needed. Moving to next subtask.")
                    break

                print(f"Action: {action}")
                self.action_history.append(action)

                new_observation = self.environment.take_action(action)
                print(f"Observation: {new_observation}")
                
                effect = self.explorer.analyze(observation, new_observation, action_info["intent"])
                new_reflections = self.explorer.reflect(new_observation, subtask, action, effect)
                reflections.update(new_reflections)

                appraisal = self.appraiser(effect, new_observation, subtask)
                print(f"Effectiveness: {appraisal['effectiveness']}, Future Promise: {appraisal['future_promise']}")

                completion = self.controller(subtask, self.action_history, new_observation)

                if completion["completeness"] == "complete":
                    print(f"Subtask completed: {subtask}")
                    print(f"Reflection: {completion['reflection']}")
                    break

                if "already" in new_observation.lower() or appraisal['effectiveness'] >= 8:
                    print(f"Subtask seems to be completed: {subtask}")
                    break

                observation = new_observation

            if completion["completeness"] != "complete" and attempt == self.subtask_attempt_limit - 1:
                print(f"Failed to complete subtask: {subtask}")
                print(f"Reflection: {completion['reflection']}")

        print("Task execution completed.")

# Example usage
webpilot = WebPilot()
task = "Navigate to the 'Members' page of the 'dotfiles' repository on GitHub and invite 'Abishek' as a guest."
webpilot.execute_task(task)

Mind Map

Highlights explained

1. Hierarchical Task Decomposition (HTD)

a. Explanation

b. Significance

c. Context and Impact

2. Monte Carlo Tree Search (MCTS)-Enhanced Decision Strategies

a. Explanation

b. Significance

c. Context and Impact

3. Granular Bifaceted Self-Reward Mechanism

a. Explanation

b. Significance

c. Context and Impact

4. Reflective Task Adjustment (RTA)

a. Explanation

b. Significance

c. Context and Impact

5. Effective Strategies for Long-Horizon Tasks and Diverse Information Sources

Explanation

a. Integration of Visual Data

b. Continual Learning Mechanisms

c. Advanced User Modeling

d. Scalable MCTS Techniques

Code

Leave a Reply

Related Posts

Agents Thinking Fast and Slow: A Talker-Reasoner Architecture

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Cognitive Architectures for Language Agents (How to use memory in reasoning)

In-context Continual Learning Assisted by an External Continual Learner (No more RAG!)