โ— PHANTOM
๐Ÿ‡ฎ๐Ÿ‡ณ IN
โœ•
Skip to content

๐Ÿ•ท๏ธ DiscovAI Crawl API(๐Ÿšง Work in Progress ๐Ÿšง): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.

License

Notifications You must be signed in to change notification settings

DiscovAI/DiscovAI-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiscovAI Crawl API ๐Ÿ•ท๏ธ๐Ÿ”

One API to scrape everything you need from URLs for your AI tool and vector database.

๐Ÿšง Work in Progress ๐Ÿšง

๐ŸŒŸ Features

Our API provides a comprehensive suite of data extraction and processing capabilities:

  • ๐Ÿงผ Clean HTML (JavaScript and CSS removed)
  • ๐Ÿ“ LLM-friendly Markdown conversion
  • ๐Ÿšซ Ad-free, cookie banner-free, and dialog-free content
  • ๐Ÿ“ธ Website screenshots (auto-saved to AWS S3 or Cloudflare R2)
  • ๐Ÿค– LLM-generated SEO-friendly content
  • ๐Ÿ”‘ LLM-extracted key information (summary, features, FAQs, etc.)
  • ๐Ÿง  Ready-to-use embeddings for vector database integration (auto-saved to db)

๐Ÿ”ง Installation

pnpm i
cd apps/api && pnpm exec playwright install

๐Ÿš€ Usage

pnpm dev
open http://localhost:3000

๐Ÿ“ฆ API Response Structure

{
  "clean_html": "...",
  "LLM_friendly_markdown": "...",
  "clean_text": "...",
  "screenshot_url": "...",
  "llm_extracts_key_info": {
    "what": "...",
    "summary": "...",
    "features": ["...", "..."],
    "faqs": [{"q": "...", "a": "..."}]
  },
  "llm_summarized_detail": "...",
  "embeddings": [...]
}

๐Ÿ“š Documentation

TODO

๐Ÿค Contributing

TODO

About

๐Ÿ•ท๏ธ DiscovAI Crawl API(๐Ÿšง Work in Progress ๐Ÿšง): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •