# PDF Files

PDF (Portable Document Format) is a file format developed by Adobe for presenting documents consistently across software platforms. This module provides functionality to load and process PDF files using pdf.js.

This module provides a sophisticated PDF document loader that can:

* Load single or multiple PDF files
* Split documents by page or file
* Support base64 encoded files
* Handle file storage integration
* Process content with text splitters
* Support legacy PDF versions
* Customize metadata extraction

## Inputs

### Required Parameters

* **PDF File**: The PDF file(s) to process (.pdf extension)
* **Usage**: Choose between:
  * One document per page
  * One document per file

### Optional Parameters

* **Text Splitter**: A text splitter to process the extracted content
* **Use Legacy Build**: Whether to use legacy PDF.js build
* **Additional Metadata**: JSON object with additional metadata
* **Omit Metadata Keys**: Comma-separated list of metadata keys to omit

## Outputs

* **Document**: Array of document objects containing metadata and pageContent
* **Text**: Concatenated string from pageContent of documents

## Features

* Multiple file support
* Page-level splitting
* Legacy version support
* Text extraction
* Metadata handling
* Error handling
* Memory-efficient processing

## Processing Modes

### Per Page Mode

* Each page becomes a document
* Preserves page numbers
* Individual page metadata
* Granular content access

### Per File Mode

* Entire PDF as one document
* Combined content
* Single metadata set
* Memory efficient

## Document Structure

Each document contains:

* **pageContent**: Extracted text content
* **metadata**:
  * source: Original file path
  * pdf: PDF-specific metadata
  * page: Page number (in per-page mode)
  * Additional custom metadata

## File Handling

### Local Files

* Direct file loading
* Base64 encoded content
* Multiple file support

### Storage Integration

* File storage system support
* Organization-based storage
* Chatflow-based storage

## Notes

* Uses pdf.js for extraction
* Legacy version support
* Memory-efficient processing
* Error handling for invalid files
* Support for large PDFs
* Flexible output formats
* Metadata customization
* Text encoding handling


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.flowiseai.com/integrations/langchain/document-loaders/pdf-file.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
