Microsoft Powerpoint
Last updated
Last updated
Microsoft PowerPoint is a presentation program for creating and displaying slide shows. This module provides functionality to load and process PowerPoint files using officeparser.
This module provides a sophisticated PowerPoint document loader that can:
Load PowerPoint presentations
Extract text from slides
Split content into individual slides
Handle slide numbering
Process metadata per slide
Support multiple slide formats
Handle various slide separators
PowerPoint File: The PowerPoint file(s) to process (.ppt, .pptx)
Text Splitter: A text splitter to process the extracted content
Additional Metadata: JSON object with additional metadata
Omit Metadata Keys: Comma-separated list of metadata keys to omit
Document: Array of document objects containing metadata and pageContent
Text: Concatenated string from pageContent of documents
Text extraction
Slide separation
Metadata handling
Error handling
Memory-efficient processing
Heuristic slide detection
Content filtering
The loader attempts to identify slides using common patterns:
"Slide X" markers
"Page X" markers
"X/Y" page numbers
Underscore separators
Dash separators
Multiple newlines
If pattern recognition fails:
Split by double newlines
Treat content as single slide
Each document contains:
pageContent: Extracted text content from the slide
metadata:
slideNumber: Sequential slide number
documentType: "powerpoint"
Additional custom metadata
Empty slides are filtered out
Leading/trailing whitespace removed
Minimum content length validation
Reasonable slide count validation
Default attributes include:
slideNumber: Slide number (number)
documentType: Type of document (string)
Custom metadata from input
Uses officeparser for extraction
Handles various slide formats
Intelligent slide detection
Content validation
Memory-efficient processing
Error handling for invalid files
Flexible output formats
Robust fallback mechanisms