Unstructured File Loader

Use Unstructured.io to load data from a file path.

The Unstructured File Loader uses Unstructured.io to extract and process content from various file formats. It provides advanced document parsing capabilities with configurable options for OCR, chunking, and metadata extraction.

Features

Advanced document parsing
OCR support with multiple language options
Flexible chunking strategies
Table structure inference
Coordinate extraction
Page break handling
XML tag processing
Customizable model selection
Metadata extraction

Configuration

API Setup

Default API URL: https://api.unstructuredapp.io/general/v0/general
Requires API key from Unstructured.io
Can be configured via environment variables:
- UNSTRUCTURED_API_URL
- UNSTRUCTURED_API_KEY

Processing Strategies

Strategy: Default is "hi_res"
- Options include various processing strategies for different document types
Chunking Strategy:
- None (default)
- by_title (chunks text based on titles)

Parameters

Required Parameters

File: The document to process
API Key: Unstructured.io API key (if not set via environment)

Optional Parameters

OCR Options

OCR Languages: Array of languages for OCR processing
Encoding: Specify document encoding

Processing Options

Coordinates: Extract element coordinates (true/false)
PDF Table Structure: Infer table structure in PDFs (true/false)
XML Tags: Keep XML tags in output (true/false)
Skip Table Types: Array of table types to skip inference
Hi-Res Model: Specify the high-resolution model name
Include Page Breaks: Include page break information (true/false)

Text Chunking Options

Multi-page Sections: Handle sections across pages (true/false)
Combine Under N Chars: Combine elements under specified character count
New After N Chars: Create new element after specified character count
Max Characters: Maximum characters per element

Output Structure

Document Format

Each processed element becomes a document with:

pageContent: Extracted text content
metadata:
- category: Element type
- Additional metadata from the processing

Element Types

The loader can identify various element types:

Text blocks
Tables
Lists
Headers
Footers
Page breaks (if enabled)
Other structural elements

Usage Examples

Basic Configuration

{
  "apiKey": "your-api-key",
  "strategy": "hi_res",
  "ocrLanguages": ["eng"]
}

Advanced Processing

{
  "apiKey": "your-api-key",
  "strategy": "hi_res",
  "coordinates": true,
  "pdfInferTableStructure": true,
  "chunkingStrategy": "by_title",
  "multiPageSections": true,
  "combineUnderNChars": 100,
  "maxCharacters": 4000
}

Notes

API calls are made for each file processing request
Response includes structured elements with text and metadata
Elements are filtered to ensure valid text content
Supports buffer-based processing
Error handling for API responses
Automatic metadata categorization
Memory-efficient processing

Best Practices

Set appropriate chunking parameters for your use case
Consider OCR language settings for non-English documents
Enable table structure inference for documents with tables
Use coordinates when spatial information is important
Configure character limits based on your downstream processing needs
Monitor API usage and response times
Handle potential API errors in your workflow

This section is a work in progress. We appreciate any help you can provide in completing this section. Please check our Contribution Guide to get started.

PreviousText File NextUnstructured Folder Loader

Last updated 1 month ago