Microsoft Word

Microsoft Word is a word processing software for creating and editing text documents. This module provides functionality to load and process Word documents using officeparser.

This module provides a sophisticated Word document loader that can:

  • Load Word documents

  • Extract text content

  • Split content into sections

  • Handle page numbering

  • Process metadata per section

  • Support multiple section formats

  • Handle various section separators

Inputs

Required Parameters

  • Word File: The Word file(s) to process (.doc, .docx)

Optional Parameters

  • Text Splitter: A text splitter to process the extracted content

  • Additional Metadata: JSON object with additional metadata

  • Omit Metadata Keys: Comma-separated list of metadata keys to omit

Outputs

  • Document: Array of document objects containing metadata and pageContent

  • Text: Concatenated string from pageContent of documents

Features

  • Text extraction

  • Section separation

  • Metadata handling

  • Error handling

  • Memory-efficient processing

  • Heuristic section detection

  • Content filtering

Section Detection Methods

Pattern Recognition

The loader attempts to identify sections using common patterns:

  • "Page X" markers

  • "Section X" markers

  • "Chapter X" markers

  • Numbered sections (e.g., "1. ", "2. ")

  • ALL CAPS headings

  • Long underscore separators

  • Long dash separators

Fallback Mechanisms

If pattern recognition fails:

  1. Split by multiple newlines

  2. Split by double newlines

  3. Treat content as single section

Document Structure

Each document contains:

  • pageContent: Extracted text content from the section

  • metadata:

    • documentType: "word"

    • pageNumber: Sequential section number

    • Additional custom metadata

Content Processing

  • Empty sections are filtered out

  • Leading/trailing whitespace removed

  • Minimum content length validation

  • Reasonable section count validation

Metadata Attributes

Default attributes include:

  • documentType: Type of document (string)

  • pageCount: Number of pages/sections (number)

  • Custom metadata from input

Notes

  • Uses officeparser for extraction

  • Handles various document formats

  • Intelligent section detection

  • Content validation

  • Memory-efficient processing

  • Error handling for invalid files

  • Flexible output formats

  • Robust fallback mechanisms

Last updated