PDF Files

PDF (Portable Document Format) is a file format developed by Adobe for presenting documents consistently across software platforms. This module provides functionality to load and process PDF files using pdf.js.

This module provides a sophisticated PDF document loader that can:

  • Load single or multiple PDF files

  • Split documents by page or file

  • Support base64 encoded files

  • Handle file storage integration

  • Process content with text splitters

  • Support legacy PDF versions

  • Customize metadata extraction

Inputs

Required Parameters

  • PDF File: The PDF file(s) to process (.pdf extension)

  • Usage: Choose between:

    • One document per page

    • One document per file

Optional Parameters

  • Text Splitter: A text splitter to process the extracted content

  • Use Legacy Build: Whether to use legacy PDF.js build

  • Additional Metadata: JSON object with additional metadata

  • Omit Metadata Keys: Comma-separated list of metadata keys to omit

Outputs

  • Document: Array of document objects containing metadata and pageContent

  • Text: Concatenated string from pageContent of documents

Features

  • Multiple file support

  • Page-level splitting

  • Legacy version support

  • Text extraction

  • Metadata handling

  • Error handling

  • Memory-efficient processing

Processing Modes

Per Page Mode

  • Each page becomes a document

  • Preserves page numbers

  • Individual page metadata

  • Granular content access

Per File Mode

  • Entire PDF as one document

  • Combined content

  • Single metadata set

  • Memory efficient

Document Structure

Each document contains:

  • pageContent: Extracted text content

  • metadata:

    • source: Original file path

    • pdf: PDF-specific metadata

    • page: Page number (in per-page mode)

    • Additional custom metadata

File Handling

Local Files

  • Direct file loading

  • Base64 encoded content

  • Multiple file support

Storage Integration

  • File storage system support

  • Organization-based storage

  • Chatflow-based storage

Notes

  • Uses pdf.js for extraction

  • Legacy version support

  • Memory-efficient processing

  • Error handling for invalid files

  • Support for large PDFs

  • Flexible output formats

  • Metadata customization

  • Text encoding handling

Last updated