Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/docling-project/docling/llms.txt

Use this file to discover all available pages before exploring further.

Docling is available as a document converter in Haystack, enabling high-fidelity document processing in your Haystack pipelines.

Overview

The Docling Haystack integration provides:
  • Document conversion for Haystack pipelines
  • Support for multiple document formats (PDF, DOCX, PPTX, etc.)
  • High-fidelity table and layout extraction
  • Easy integration with existing Haystack workflows

Installation

pip install docling-haystack

Quick Start

Here’s a simple example of using Docling in a Haystack pipeline:
from docling_haystack import DoclingConverter
from haystack import Pipeline

# Create converter
converter = DoclingConverter()

# Convert a document
result = converter.run(
    sources=["document.pdf"]
)

# Access converted documents
for doc in result["documents"]:
    print(doc.content)
    print(doc.meta)

Building a RAG Pipeline

from docling_haystack import DoclingConverter
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Create components
document_store = InMemoryDocumentStore()
converter = DoclingConverter()
splitter = DocumentSplitter(split_length=500, split_overlap=50)
embedder = SentenceTransformersDocumentEmbedder()
writer = DocumentWriter(document_store=document_store)

# Build pipeline
pipeline = Pipeline()
pipeline.add_component("converter", converter)
pipeline.add_component("splitter", splitter)
pipeline.add_component("embedder", embedder)
pipeline.add_component("writer", writer)

# Connect components
pipeline.connect("converter.documents", "splitter.documents")
pipeline.connect("splitter.documents", "embedder.documents")
pipeline.connect("embedder.documents", "writer.documents")

# Run pipeline
result = pipeline.run({
    "converter": {"sources": ["document.pdf"]}
})

Advanced Configuration

Custom Conversion Options

from docling_haystack import DoclingConverter
from docling.datamodel.pipeline_options import PipelineOptions

# Configure Docling options
pipeline_options = PipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True

# Create converter with options
converter = DoclingConverter(
    pipeline_options=pipeline_options
)

result = converter.run(sources=["document.pdf"])

Batch Processing

from docling_haystack import DoclingConverter
import glob

converter = DoclingConverter()

# Process multiple files
files = glob.glob("documents/*.pdf")
result = converter.run(sources=files)

print(f"Converted {len(result['documents'])} documents")

Features

Pipeline Integration

Seamlessly integrates into Haystack pipelines

Multi-Format Support

Supports PDF, DOCX, PPTX, HTML, and more

Table Extraction

Accurately extracts table structures

OCR Support

Process scanned documents and images

Complete RAG Application

from docling_haystack import DoclingConverter
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Initialize document store
document_store = InMemoryDocumentStore()

# Indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", DoclingConverter())
indexing_pipeline.add_component("writer", DocumentWriter(document_store))
indexing_pipeline.connect("converter.documents", "writer.documents")

# Index documents
indexing_pipeline.run({"converter": {"sources": ["document.pdf"]}})

# Query pipeline
template = """
Given the following documents, answer the question.

Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""

query_pipeline = Pipeline()
query_pipeline.add_component("retriever", InMemoryBM25Retriever(document_store))
query_pipeline.add_component("prompt_builder", PromptBuilder(template=template))
query_pipeline.add_component("llm", OpenAIGenerator())

query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder.prompt", "llm.prompt")

# Query
response = query_pipeline.run({
    "retriever": {"query": "What is the main topic?"},
    "prompt_builder": {"question": "What is the main topic?"}
})

print(response["llm"]["replies"][0])

Use Cases

1

Document Indexing

Convert and index large document collections
2

RAG Applications

Build question-answering systems over documents
3

Content Extraction

Extract structured content from unstructured documents
4

Search Pipelines

Enable semantic search over document collections

Resources

Documentation

Official Haystack integration docs

GitHub

Source code and examples

Example Notebook

Complete RAG example

PyPI

Package repository

Next Steps