Playwright Web Scraper

Playwright is a powerful library for browser automation that can control Chromium, Firefox, and WebKit with a single API. This module provides advanced web scraping capabilities using Playwright to extract content from web pages, including dynamic content that requires JavaScript execution.

This module provides a sophisticated web scraper that can:

Load content from single or multiple web pages
Handle JavaScript-rendered content
Support various page load strategies
Wait for specific elements to load
Crawl relative links from websites
Process XML sitemaps

Inputs

URL: The webpage URL to scrape
Text Splitter (optional): A text splitter to process the extracted content
Get Relative Links Method (optional): Choose between:
- Web Crawl: Crawl relative links from HTML URL
- Scrape XML Sitemap: Scrape relative links from XML sitemap URL
Get Relative Links Limit (optional): Limit for number of relative links to process (default: 10, 0 for all links)
Wait Until (optional): Page load strategy:
- Load: Wait for the load event to fire
- DOM Content Loaded: Wait for the DOMContentLoaded event
- Network Idle: Wait until no network connections for 500ms
- Commit: Wait for initial network response and document loading
Wait for selector to load (optional): CSS selector to wait for before scraping
Additional Metadata (optional): JSON object with additional metadata to add to documents
Omit Metadata Keys (optional): Comma-separated list of metadata keys to omit

Outputs

Document: Array of document objects containing metadata and pageContent
Text: Concatenated string from pageContent of documents

Features

Multi-browser engine support (Chromium, Firefox, WebKit)
JavaScript execution support
Configurable page load strategies
Element wait capabilities
Web crawling functionality
XML sitemap processing
Headless browser operation
Sandbox configuration
Error handling for invalid URLs
Metadata customization

Notes

Runs in headless mode by default
Uses no-sandbox mode for compatibility
Invalid URLs will throw an error
Setting link limit to 0 will retrieve all available links (may take longer)
Supports waiting for specific DOM elements before extraction

Scrape One URL

(Optional) Connect Text Splitter.
Input desired URL to be scraped.

Crawl & Scrape Multiple URLs

Visit Web Crawl guide to allow scraping of multiple pages.

Resources

PreviousPlain Text NextPuppeteer Web Scraper

Last updated 2 months ago

Inputs

URL: The webpage URL to scrape

Text Splitter (optional): A text splitter to process the extracted content

Get Relative Links Method (optional): Choose between:

Web Crawl: Crawl relative links from HTML URL
Scrape XML Sitemap: Scrape relative links from XML sitemap URL

Get Relative Links Limit (optional): Limit for number of relative links to process (default: 10, 0 for all links)

Wait Until (optional): Page load strategy:

Load: Wait for the load event to fire
DOM Content Loaded: Wait for the DOMContentLoaded event
Network Idle: Wait until no network connections for 500ms
Commit: Wait for initial network response and document loading

Wait for selector to load (optional): CSS selector to wait for before scraping

Additional Metadata (optional): JSON object with additional metadata to add to documents

Omit Metadata Keys (optional): Comma-separated list of metadata keys to omit