Microsoft Word
Last updated
Last updated
Microsoft Word is a word processing software for creating and editing text documents. This module provides functionality to load and process Word documents using officeparser.
This module provides a sophisticated Word document loader that can:
Load Word documents
Extract text content
Split content into sections
Handle page numbering
Process metadata per section
Support multiple section formats
Handle various section separators
Word File: The Word file(s) to process (.doc, .docx)
Text Splitter: A text splitter to process the extracted content
Additional Metadata: JSON object with additional metadata
Omit Metadata Keys: Comma-separated list of metadata keys to omit
Document: Array of document objects containing metadata and pageContent
Text: Concatenated string from pageContent of documents
Text extraction
Section separation
Metadata handling
Error handling
Memory-efficient processing
Heuristic section detection
Content filtering
The loader attempts to identify sections using common patterns:
"Page X" markers
"Section X" markers
"Chapter X" markers
Numbered sections (e.g., "1. ", "2. ")
ALL CAPS headings
Long underscore separators
Long dash separators
If pattern recognition fails:
Split by multiple newlines
Split by double newlines
Treat content as single section
Each document contains:
pageContent: Extracted text content from the section
metadata:
documentType: "word"
pageNumber: Sequential section number
Additional custom metadata
Empty sections are filtered out
Leading/trailing whitespace removed
Minimum content length validation
Reasonable section count validation
Default attributes include:
documentType: Type of document (string)
pageCount: Number of pages/sections (number)
Custom metadata from input
Uses officeparser for extraction
Handles various document formats
Intelligent section detection
Content validation
Memory-efficient processing
Error handling for invalid files
Flexible output formats
Robust fallback mechanisms