โ— PHANTOM
๐Ÿ‡ฎ๐Ÿ‡ณ IN
โœ•
Skip to content

Kaangml/autonomous_browser_ai_agent

Repository files navigation

๐Ÿค– Autonomous Browser AI Agent

An intelligent multi-agent browser automation system powered by LLMs (Gemini, OpenAI, AWS Bedrock). The agent can understand natural language tasks, plan multi-step browser actions, execute them autonomously, and self-correct when things go wrong.

โœจ Features

  • Multi-Agent Architecture: Orchestrator โ†’ Planner โ†’ Executor โ†’ Evaluator loop
  • LLM Integration: AWS Bedrock (Claude), Google Gemini, OpenAI support
  • DOM-Aware Planning: Intelligent element detection and selector generation
  • Self-Correction: Automatic re-planning on failures with retry logic
  • Browser Automation: Full Playwright integration (navigate, click, fill, extract, screenshot)
  • Safety Controls: URL scheme filtering, loop detection, max-step limits
  • Human-like Behavior: Configurable delays to reduce bot detection

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/Kaangml/autonomous_browser_ai_agent.git
cd autonomous_browser_ai_agent

# Install dependencies with uv
uv sync

# Install Playwright browsers
uv run playwright install chromium

# Copy environment template and add your API key
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY (or other provider)

Run Your First Task

# Simple CLI usage
uv run python -m src --url "https://example.com" --task "extract the page title"

# With visible browser
uv run python -m src --url "https://example.com" --task "extract content" --no-headless

# JSON output
uv run python -m src --url "https://example.com" --task "get the heading" --json

Run Examples

# Multi-agent example with Gemini
uv run python -m src.examples.example_multiagent

# Wikipedia extraction
uv run python -m src.examples.example_wikipedia

# DuckDuckGo search
uv run python -m src.examples.example_search

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      ORCHESTRATOR                           โ”‚
โ”‚         Coordinates the multi-agent workflow                โ”‚
โ”‚              Plan โ†’ Execute โ†’ Evaluate                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ–ผ             โ–ผ             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚PLANNER โ”‚   โ”‚EXECUTORโ”‚   โ”‚ EVALUATOR  โ”‚
โ”‚        โ”‚   โ”‚        โ”‚   โ”‚            โ”‚
โ”‚ - DOM  โ”‚   โ”‚ - Run  โ”‚   โ”‚ - Check    โ”‚
โ”‚   awareโ”‚   โ”‚   stepsโ”‚   โ”‚   success  โ”‚
โ”‚ - LLM  โ”‚   โ”‚ - Retryโ”‚   โ”‚ - Trigger  โ”‚
โ”‚   plan โ”‚   โ”‚   logicโ”‚   โ”‚   replan   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
                  โ–ผ
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚   BROWSER     โ”‚
         โ”‚  CONTROLLER   โ”‚
         โ”‚               โ”‚
         โ”‚ - Playwright  โ”‚
         โ”‚ - Safety      โ”‚
         โ”‚ - Actions     โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

See docs/ARCHITECTURE.md for detailed documentation.

โš™๏ธ Configuration

Environment Variables (.env)

# LLM Provider (choose one)
GEMINI_API_KEY=your_key_here
GEMINI_MODEL=gemini-2.0-flash

# Or OpenAI
# OPENAI_API_KEY=your_key_here
# OPENAI_MODEL=gpt-4-turbo

# Or AWS Bedrock
# AWS_ACCESS_KEY_ID=your_key
# AWS_SECRET_ACCESS_KEY=your_secret
# AWS_REGION=us-east-1
# BEDROCK_MODEL_ID=anthropic.claude-3-sonnet-20240229-v1:0

Browser Settings

from browser.browser_config import BrowserConfigManager

config = BrowserConfigManager.load_from_settings()
# config.config.headless = False      # Show browser
# config.config.timeout = 30          # Timeout in seconds
# config.config.human_delay_min = 0.5 # Min delay between actions

๐Ÿ“– Python API

Using the Multi-Agent System

import asyncio
from dotenv import load_dotenv
load_dotenv()

from llm.factory import get_llm_provider
from agent.planner import PlannerAgent
from agent.executor import ExecutorAgent
from browser.browser import BrowserManager
from browser.browser_config import BrowserConfigManager
from browser.actions import BrowserActions
from controller.browser_controller import BrowserController

async def main():
    # Setup LLM and agents
    llm = get_llm_provider()
    planner = PlannerAgent(llm=llm)
    
    # Setup browser
    config = BrowserConfigManager.load_from_settings()
    browser = BrowserManager(config)
    await browser.start()
    
    actions = BrowserActions(browser)
    controller = BrowserController(actions)
    executor = ExecutorAgent(controller=controller)
    
    try:
        # Plan the task
        steps = await planner.plan("Go to example.com and extract the title")
        print(f"Plan: {len(steps)} steps")
        
        # Execute each step
        page = None
        for step in steps:
            result = await executor.execute(step, page)
            if result.get("page"):
                page = result["page"]
            print(f"{step['type']}: {result.get('ok')}")
            
    finally:
        await browser.close()

asyncio.run(main())

Low-Level Controller Usage

import asyncio
from browser.browser_config import BrowserConfigManager
from browser.browser import BrowserManager
from browser.actions import BrowserActions
from controller.browser_controller import BrowserController

async def main():
    config = BrowserConfigManager.load_from_settings()
    browser = BrowserManager(config)
    await browser.start()
    
    actions = BrowserActions(browser)
    controller = BrowserController(actions)
    
    try:
        # Navigate
        result = await controller.execute_action({
            "type": "goto",
            "args": {"url": "https://example.com"}
        })
        page = result["page"]
        
        # Extract text
        text = await controller.execute_action({
            "type": "extract_text",
            "args": {"page": page, "selector": "h1"}
        })
        print(text["result"])  # "Example Domain"
        
    finally:
        await browser.close()

asyncio.run(main())

๐Ÿ“‹ Supported Actions

Action Description Args
goto Navigate to URL url
click Click element page, selector
fill Type into input page, selector, text
extract_text Get element text page, selector
links Get all links page, selector?
screenshot Capture page page, full_page?
scroll Scroll page page, selector?
wait Wait for element page, selector

๐Ÿณ Docker

# Build the image
docker build -t browser-agent .

# Run with your API key
docker run -e GEMINI_API_KEY=your_key browser-agent \
  --url "https://example.com" --task "extract the title"

See Dockerfile for details.

๐Ÿงช Testing

# Run all tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run specific test module
uv run pytest tests/agent/ -v

# Run with coverage
uv run pytest --cov=src

๐Ÿ“š Documentation

๐Ÿ—บ๏ธ Roadmap

Completed โœ…

  • Multi-agent LLM system (Orchestrator, Planner, Executor, Evaluator)
  • LLM provider abstraction (Bedrock, Gemini, OpenAI)
  • DOM-aware intelligent planning
  • Retry logic and error handling
  • Mock provider for testing

Planned ๐Ÿ“‹

  • Persistent memory (SQLite/vector DB)
  • Job queue and workflow scheduling
  • Web UI for task management
  • Browser extension integration
  • Multi-tab support

๐Ÿ“„ License

MIT

๐Ÿค Contributing

Contributions welcome! Please read the ROADMAP.md first, then open a PR.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published