Unstructured Folder Loader
Use Unstructured.io to load data from a folder. Note: Currently doesn't support .png and .heic until unstructured is updated.

The Unstructured Folder Loader uses Unstructured.io to load and process multiple documents from a folder. It provides advanced document parsing capabilities with extensive configuration options for OCR, chunking, and metadata extraction.
Currently doesn't support .png and .heic files until unstructured is updated.
Features
Batch processing of multiple documents
Multiple processing strategies
OCR support with 15+ languages
Flexible chunking strategies
Table structure inference
XML processing options
Page break handling
Coordinate extraction
Metadata customization
Configuration
API Setup
Default API URL:
http://localhost:8000/general/v0/general
Can be configured via environment variable:
UNSTRUCTURED_API_URL
Optional API key authentication
Parameters
Required Parameters
Folder Path: Path to the folder containing documents to process
Optional Parameters
Basic Configuration
Unstructured API URL: API endpoint (default: http://localhost:8000/general/v0/general)
Strategy: Processing strategy (default: auto)
hi_res: High resolution processing
fast: Quick processing
ocr_only: OCR-focused processing
auto: Automatic selection
Encoding: Document encoding (default: utf-8)
OCR Options
OCR Languages: Multiple language support including:
English (eng)
Spanish (spa)
Mandarin Chinese (cmn)
Hindi (hin)
Arabic (ara)
Portuguese (por)
Bengali (ben)
Russian (rus)
Japanese (jpn)
And more...
Processing Options
Skip Infer Table Types: File types to skip table extraction (default: ["pdf", "jpg", "png"])
Hi-Res Model Name: Model selection for hi_res strategy (default: detectron2_onnx)
chipper: Unstructured's in-house VDU model
detectron2_onnx: Facebook AI's fast object detection
yolox: Single-stage real-time detector
yolox_quantized: Optimized YOLOX version
Coordinates: Extract element coordinates (default: false)
Include Page Breaks: Include page break elements
XML Keep Tags: Preserve XML tags
Multi-Page Sections: Handle multi-page sections
Text Chunking Options
Chunking Strategy: Text chunking method (default: by_title)
None: No chunking
by_title: Chunk by document titles
Combine Under N Chars: Minimum chunk size
New After N Chars: Soft maximum chunk size
Max Characters: Hard maximum chunk size (default: 500)
Metadata Options
Source ID Key: Key for document source identification (default: source)
Additional Metadata: Custom metadata as JSON
Omit Metadata Keys: Keys to exclude from metadata
Supported File Types
Documents: .doc, .docx, .odt, .ppt, .pptx, .pdf
Spreadsheets: .xls, .xlsx
Text: .txt, .text, .md, .rtf
Web: .html, .htm
Email: .eml, .msg
Images: .jpg, .jpeg (Note: .png and .heic currently unsupported)
Output Structure
Document Format
Each processed document includes:
pageContent: Extracted text content
metadata:
source: Document source identifier
Additional metadata from processing
Custom metadata (if specified)
Usage Examples
Basic Configuration
{
"folderPath": "/path/to/documents",
"strategy": "auto",
"encoding": "utf-8"
}
Advanced Processing
{
"folderPath": "/path/to/documents",
"strategy": "hi_res",
"hiResModelName": "detectron2_onnx",
"ocrLanguages": ["eng", "spa", "fra"],
"chunkingStrategy": "by_title",
"maxCharacters": 500,
"coordinates": true,
"metadata": {
"source": "company_docs",
"department": "legal"
}
}
Best Practices
Choose appropriate strategy based on document quality and processing needs
Configure OCR languages based on document content
Adjust chunking parameters for optimal text segmentation
Use appropriate hi-res model for your use case
Consider memory usage when processing large folders
Monitor API usage and response times
Handle potential API errors in your workflow
Notes
Process multiple documents in batch
Supports various file formats
Memory-efficient processing
Automatic metadata handling
Flexible output formats
Error handling for API responses
Configurable processing options
Last updated