Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/docling-project/docling/llms.txt

Use this file to discover all available pages before exploring further.

Process multiple file formats (PDF, DOCX, PPTX, HTML, images, etc.) with customized handling per format.

Overview

This example shows how to:
  • Convert a mixed list of file formats
  • Restrict allowed formats with an explicit whitelist
  • Override pipeline and backend settings per format
  • Export results to Markdown, JSON, and YAML

Basic Multi-Format Conversion

run_with_formats.py
from pathlib import Path
import json
import yaml
from docling.datamodel.base_models import InputFormat
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.document_converter import (
    DocumentConverter,
    PdfFormatOption,
    WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline

input_paths = [
    Path("README.md"),
    Path("tests/data/html/wiki_duck.html"),
    Path("tests/data/docx/word_sample.docx"),
    Path("tests/data/pptx/powerpoint_sample.pptx"),
    Path("tests/data/2305.03393v1-pg9-img.png"),
    Path("tests/data/pdf/2206.01062.pdf"),
]

Configure Format-Specific Options

1

Whitelist Formats

Specify which formats to process. Non-matching files are ignored.
2

Override Per-Format Settings

Customize pipeline and backend for specific formats.
3

Convert All Documents

Process the mixed document list.
doc_converter = DocumentConverter(
    allowed_formats=[
        InputFormat.PDF,
        InputFormat.IMAGE,
        InputFormat.DOCX,
        InputFormat.HTML,
        InputFormat.PPTX,
        InputFormat.ASCIIDOC,
        InputFormat.CSV,
        InputFormat.MD,
    ],
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=StandardPdfPipeline,
            backend=PyPdfiumDocumentBackend
        ),
        InputFormat.DOCX: WordFormatOption(
            pipeline_cls=SimplePipeline
        ),
    },
)

conv_results = doc_converter.convert_all(input_paths)
Files not in allowed_formats are silently skipped during conversion.

Export Results

output_dir = Path("scratch")
output_dir.mkdir(parents=True, exist_ok=True)

for res in conv_results:
    doc_filename = res.input.file.stem
    
    # Export to Markdown
    with (output_dir / f"{doc_filename}.md").open("w") as fp:
        fp.write(res.document.export_to_markdown())
    
    # Export to JSON
    with (output_dir / f"{doc_filename}.json").open("w") as fp:
        fp.write(json.dumps(res.document.export_to_dict()))
    
    # Export to YAML
    with (output_dir / f"{doc_filename}.yaml").open("w") as fp:
        fp.write(yaml.safe_dump(res.document.export_to_dict()))
    
    print(f"Converted {res.input.file.name}")

Format Options

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline

InputFormat.PDF: PdfFormatOption(
    pipeline_cls=StandardPdfPipeline,
    backend=PyPdfiumDocumentBackend
)

Supported Formats

Docling supports:
  • PDF: Native and scanned PDFs
  • DOCX: Microsoft Word documents
  • PPTX: PowerPoint presentations
  • HTML: Web pages
  • Images: PNG, JPG, TIFF
  • Markdown: MD files
  • AsciiDoc: ASCIIDOC files
  • CSV: Comma-separated values

Default vs Custom Configuration

# Default: No explicit configuration needed
doc_converter = DocumentConverter()

# Custom: Override specific formats
doc_converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
    format_options={
        InputFormat.PDF: PdfFormatOption(...),
    },
)