Microsoft Word

Microsoft Word is a word processing software for creating and editing text documents. This module provides functionality to load and process Word documents using officeparser.

This module provides a sophisticated Word document loader that can:

Load Word documents
Extract text content
Split content into sections
Handle page numbering
Process metadata per section
Support multiple section formats
Handle various section separators

Inputs

Required Parameters

Word File: The Word file(s) to process (.doc, .docx)

Optional Parameters

Text Splitter: A text splitter to process the extracted content
Additional Metadata: JSON object with additional metadata
Omit Metadata Keys: Comma-separated list of metadata keys to omit

Outputs

Document: Array of document objects containing metadata and pageContent
Text: Concatenated string from pageContent of documents

Features

Text extraction
Section separation
Metadata handling
Error handling
Memory-efficient processing
Heuristic section detection
Content filtering

Section Detection Methods

Pattern Recognition

The loader attempts to identify sections using common patterns:

"Page X" markers
"Section X" markers
"Chapter X" markers
Numbered sections (e.g., "1. ", "2. ")
ALL CAPS headings
Long underscore separators
Long dash separators

Fallback Mechanisms

If pattern recognition fails:

Split by multiple newlines
Split by double newlines
Treat content as single section

Document Structure

Each document contains:

pageContent: Extracted text content from the section
metadata:
- documentType: "word"
- pageNumber: Sequential section number
- Additional custom metadata

Content Processing

Empty sections are filtered out
Leading/trailing whitespace removed
Minimum content length validation
Reasonable section count validation

Metadata Attributes

Default attributes include:

documentType: Type of document (string)
pageCount: Number of pages/sections (number)
Custom metadata from input

Notes

Uses officeparser for extraction
Handles various document formats
Intelligent section detection
Content validation
Memory-efficient processing
Error handling for invalid files
Flexible output formats
Robust fallback mechanisms

PreviousMicrosoft Powerpoint NextNotion

Last updated 2 months ago