Documentation Index
Fetch the complete documentation index at: https://mintlify.com/docling-project/docling/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The DOCX backend (MsWordDocumentBackend) parses Microsoft Word documents (.docx files) and converts them directly to DoclingDocument format. It preserves document structure, formatting, and embedded content without requiring ML-based analysis.
Features
- Complete structure preservation - Headings, paragraphs, lists, tables
- Rich formatting support - Bold, italic, underline, strikethrough, superscript, subscript
- Hyperlinks and cross-references - Preserves internal and external links
- Table extraction - Full table structure with merged cells
- Image extraction - Embedded pictures and diagrams
- Equation support - Converts Office Math (OMML) to LaTeX
- Textbox content - Extracts text from textboxes and shapes
- Comments - Preserves document comments
- Header and footer - Extracts header/footer content
- List numbering - Maintains numbered and bulleted lists
Usage
Basic Conversion
With Format Options
Supported Elements
Text and Formatting
Paragraphs and Headings
Paragraphs and Headings
The backend automatically detects:
- Heading levels (H1-H9) based on paragraph styles
- Title and subtitle styles
- Normal paragraphs and body text
- Numbered headings (preserves numbering)
Text Formatting
Text Formatting
Supported inline formatting:
- Bold (
<w:b>) - Italic (
<w:i>) - Underline (
<w:u>) Strikethrough(<w:strike>)- Subscript and superscript (
<w:vertAlign>)
Hyperlinks
Hyperlinks
Internal and external hyperlinks are preserved:
Lists
The backend fully supports Word’s list structures:- Bulleted lists - Unordered lists with various bullet styles
- Numbered lists - Ordered lists with automatic numbering
- Multi-level lists - Nested list hierarchies
- Mixed lists - Combination of numbered and bulleted items
Tables
Complete table extraction with:- Cell content and formatting
- Merged cells (rowspan/colspan)
- Header row detection
- Nested table support
Images and Diagrams
Extracts embedded images:- Inline pictures
- Floating images
- DrawingML shapes (requires LibreOffice)
- VML graphics
Equations
Office Math ML (OMML) equations are converted to LaTeX:Textboxes
Content from textboxes and shapes is extracted:- Modern Word textboxes (
<w:txbxContent>) - Legacy VML textboxes
- DrawingML shape text
DrawingML Support
For complex DrawingML elements (charts, diagrams, SmartArt), Docling can use LibreOffice for conversion:Setup
Comments
Document comments are extracted and linked to their annotated paragraphs:Header and Footer
Header and footer content is extracted as furniture-layer content:Advanced Features
Numbered Headings
Word documents with numbered headings (e.g., “1.2.3 Section Title”) preserve numbering:List Counters
The backend tracks list counters across the document:- Separate counters per list ID and level
- Automatic reset on new sequences
- Support for custom start numbers
Style Detection
Automatic detection of Word styles:- Built-in styles (Heading 1-9, Title, Normal, etc.)
- Custom user styles
- Style inheritance
Limitations
Performance
- Speed: Very fast for declarative conversion (no ML models)
- Memory: Low memory footprint
- Concurrency: Thread-safe per document instance
Troubleshooting
Missing images
Missing images
Cause: DrawingML shapes require LibreOfficeSolution:
Incorrect list numbering
Incorrect list numbering
Cause: Custom numbering formats or broken documentSolution: Check source document in Word, ensure numbering is valid
Missing text from textboxes
Missing text from textboxes
Cause: Nested or complex textbox structuresWorkaround: Backend attempts multiple textbox formats; some edge cases may not extract
Equation rendering issues
Equation rendering issues
Cause: Complex OMML structuresNote: Most standard equations convert correctly to LaTeX
Export Formats
After conversion, export to various formats:See Also
- Backends Overview - Backend architecture
- PPTX Backend - PowerPoint processing
- XLSX Backend - Excel processing
- DocumentConverter - Main conversion API