Microsoft Word

Microsoft Word is a word processing software for creating and editing text documents. This module provides functionality to load and process Word documents using officeparser.
This module provides a sophisticated Word document loader that can:
Load Word documents
Extract text content
Split content into sections
Handle page numbering
Process metadata per section
Support multiple section formats
Handle various section separators
Inputs
Required Parameters
Word File: The Word file(s) to process (.doc, .docx)
Optional Parameters
Text Splitter: A text splitter to process the extracted content
Additional Metadata: JSON object with additional metadata
Omit Metadata Keys: Comma-separated list of metadata keys to omit
Outputs
Document: Array of document objects containing metadata and pageContent
Text: Concatenated string from pageContent of documents
Features
Text extraction
Section separation
Metadata handling
Error handling
Memory-efficient processing
Heuristic section detection
Content filtering
Section Detection Methods
Pattern Recognition
The loader attempts to identify sections using common patterns:
"Page X" markers
"Section X" markers
"Chapter X" markers
Numbered sections (e.g., "1. ", "2. ")
ALL CAPS headings
Long underscore separators
Long dash separators
Fallback Mechanisms
If pattern recognition fails:
Split by multiple newlines
Split by double newlines
Treat content as single section
Document Structure
Each document contains:
pageContent: Extracted text content from the section
metadata:
documentType: "word"
pageNumber: Sequential section number
Additional custom metadata
Content Processing
Empty sections are filtered out
Leading/trailing whitespace removed
Minimum content length validation
Reasonable section count validation
Metadata Attributes
Default attributes include:
documentType: Type of document (string)
pageCount: Number of pages/sections (number)
Custom metadata from input
Notes
Uses officeparser for extraction
Handles various document formats
Intelligent section detection
Content validation
Memory-efficient processing
Error handling for invalid files
Flexible output formats
Robust fallback mechanisms
Last updated