Docling vs. Graphlit: When Open-Source Extraction Falls Short

Docling is IBM Research's open-source document parsing toolkit that's gained significant attention since its 2024 release. It promises to convert PDFs into structured Markdown using AI models for layout analysis and table recognition.

The appeal is obvious: free, open-source, runs locally, integrates with LangChain and LlamaIndex. For developers experimenting with RAG applications, it's an attractive starting point.

But there's a gap between GitHub stars and production readiness. Docling has real limitations that become apparent when you move beyond simple test documents to real-world enterprise content.

This comparison explains what Docling does well, where it struggles, and why production AI applications typically need more robust extraction — plus the semantic infrastructure that Graphlit provides.

TL;DR — Quick Comparison
What Docling Promises
Where Docling Struggles
What Graphlit Provides
When Docling Might Work
When You Need Production Infrastructure
Extraction Quality Comparison
The Real Cost of "Free"

TL;DR — Quick Comparison

Capability	Docling	Graphlit
Type	Open-source Python library	Managed semantic infrastructure platform
PDF Extraction	Basic — struggles with complex documents	Multiple backends (Azure AI, Claude, Reducto)
Scanned Documents	Limited OCR capabilities	Enterprise-grade OCR via Azure AI
Complex Layouts	Often fails on intricate designs	Handled by production extractors
Table Extraction	TableFormer model — inconsistent on complex tables	Best-in-class via Claude or Reducto
Handwritten Content	Not supported	Supported via Azure AI
Reliability	Known hanging issues on some PDFs	Production-grade with error handling
Output Format	Markdown, JSON	Markdown with automatic downstream processing
Vector Embeddings	Not included	Automatic on ingestion
Entity Extraction	Not included	Automatic Schema.org entities
Knowledge Graphs	Not included	Per-user knowledge graphs
Semantic Search	Not included	Hybrid vector + keyword + graph search
RAG Conversations	Not included	Built-in streaming conversations
Support	GitHub issues	Commercial support available
Pricing	Free (you manage infrastructure)	Usage-based, includes full infrastructure

What Docling Promises

Docling markets itself as an efficient document parsing solution:

Layout Analysis

Uses the DocLayNet model to understand document structure — headers, paragraphs, lists, figures.

Table Recognition

TableFormer model attempts to extract table structures and convert them to Markdown.

Local Processing

Runs entirely on your hardware — no data leaves your environment.

Framework Integration

Connects with LangChain and LlamaIndex for RAG applications.

Multiple Formats

Supports PDF, DOCX, PPTX, HTML, images, and AsciiDoc.

For simple, clean, digitally-created PDFs, Docling can produce reasonable results. The problem is that real-world documents are rarely simple or clean.

Where Docling Struggles

Users have reported significant limitations that affect production use:

Scanned Documents and OCR

Docling's OCR capabilities are limited. Scanned documents, photos of documents, or PDFs with image-based content often produce poor results or fail entirely.

"While Docling excels at converting documents into markdown while preserving layout and formatting, it struggles with complex tasks such as parsing scanned documents, handwritten content, and images."

Complex Layouts

Documents with intricate designs, multi-column layouts, sidebars, or non-standard formatting frequently confuse Docling's layout analysis.

Hanging and Timeouts

There are documented issues where Docling hangs indefinitely on certain PDFs, even with timeout configurations:

"The conversion process hangs indefinitely... despite configuring the converter to disable OCR and table structure processing, and setting a document timeout of 120 seconds."

Inconsistent Table Extraction

While TableFormer works on simple tables, complex tables with merged cells, nested structures, or unusual formatting often produce garbled output.

No Error Recovery

When Docling fails, it often fails silently or hangs. Production systems need graceful error handling, retries, and fallback strategies.

Resource Management

Running AI models locally requires careful resource management. Without proper infrastructure, memory leaks and processing bottlenecks are common.

What Graphlit Provides

Graphlit takes a different approach: production-grade extraction backends plus complete semantic infrastructure.

Multiple Extraction Backends

Choose the right tool for each document type:

Azure AI Document Intelligence: Enterprise-grade OCR, 275+ languages, handles scanned documents
Claude/Anthropic: Best-in-class for complex layouts and tables
Reducto: Specialized for structured documents
Deepseek: Cost-effective for high volume

Production Reliability

Automatic retries with exponential backoff
Graceful error handling
Timeout management
Processing status tracking

Everything After Extraction

Graphlit doesn't stop at extraction:

Automatic embedding for vector search
Entity extraction (people, organizations, events)
Knowledge graphs connecting entities across documents
Hybrid search (vector + keyword + graph)
RAG conversations with streaming

30+ Data Connectors

Beyond PDF uploads:

Slack, Discord, Teams, Email
GitHub, Linear, Jira
Google Drive, Dropbox, SharePoint
RSS feeds, podcasts, YouTube

When Docling Might Work

Docling can be reasonable for:

Experimentation: Learning RAG concepts, prototyping ideas
Simple PDFs: Clean, digitally-created documents with standard layouts
Privacy requirements: When data absolutely cannot leave your network
Cost sensitivity: When "free" is the only option (but see "Real Cost" below)

If you're a developer experimenting with document processing on simple test cases, Docling can help you understand the problem space.

When You Need Production Infrastructure

Move beyond Docling when:

You Have Real Documents

Enterprise documents are messy — scanned contracts, handwritten notes, complex financial reports, legacy PDFs. Production extractors handle this; Docling often doesn't.

Reliability Matters

If your application can't hang indefinitely or silently fail, you need production infrastructure with proper error handling, monitoring, and fallbacks.

You Need More Than Extraction

Extraction is step one. Search, entity extraction, knowledge graphs, and conversations require significant additional infrastructure that Docling doesn't provide.

You Have Multiple Data Sources

Real knowledge bases include Slack conversations, emails, GitHub issues — not just PDFs. Docling is PDF-focused; Graphlit connects 30+ sources.

You Value Your Time

Building production document processing infrastructure is a multi-month engineering project. Graphlit provides it out of the box.

Extraction Quality Comparison

We've benchmarked PDF extraction across multiple services. For complex documents with tables:

Docling

Inconsistent table structure restoration
Struggles with merged cells and complex formatting
Layout analysis fails on non-standard documents
OCR quality significantly below enterprise alternatives

Graphlit with Claude (LLM Mode)

Most accurate table extraction available
Handles complex layouts and visual elements
Understands document context, not just structure
Consistent quality across document types

Graphlit with Azure AI (Default)

Enterprise-grade OCR (275+ languages)
Reliable table and layout detection
Handles scanned documents
Production-tested at scale

The quality gap is significant. For production applications, extraction accuracy directly impacts AI response quality.

The Real Cost of "Free"

Docling is free to download. But "free" has hidden costs:

Infrastructure Costs

Running AI models locally requires:

GPU compute (or slow CPU processing)
Memory management
Storage for models and outputs
Scaling infrastructure as volume grows

Engineering Time

Building production document processing means:

Error handling and retries
Monitoring and alerting
Queue management for async processing
Fallback strategies when extraction fails

The Extraction Gap

Docling gives you (sometimes unreliable) extraction. You still need:

Vector database setup and management
Embedding pipeline
Entity extraction
Search infrastructure
RAG implementation

Opportunity Cost

Every hour debugging Docling hanging issues is an hour not building your actual application.

Total Cost Calculation

For a team building a RAG application:

Docling "free" approach:

Engineering time: 2-4 months building infrastructure
GPU/compute costs: $200-500/month
Vector database: $70-200/month
Embedding API: $50-100/month
Ongoing maintenance: Significant
Plus: Quality issues requiring manual intervention

Graphlit:

Starter plan: $49/month
Setup time: Hours, not months
Infrastructure: Included
Maintenance: Zero
Quality: Production-grade

Integration Example

Docling: Basic Extraction (When It Works)

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# May hang, may fail, may produce poor results
result = converter.convert("document.pdf")

# If it works, you get Markdown
markdown = result.document.export_to_markdown()

# Now you need to:
# 1. Handle errors and timeouts
# 2. Build embedding pipeline
# 3. Set up vector database
# 4. Create entity extraction
# 5. Build search infrastructure
# 6. Implement RAG conversations
# 7. Handle all the edge cases Docling misses

Graphlit: Production Infrastructure

import { Graphlit, Types } from 'graphlit-client';

const client = new Graphlit();

// Reliable extraction with Azure AI (default) or Claude (for complex docs)
const result = await client.ingestUri(
    "https://example.com/complex-report.pdf",
    "Annual Report"
);

// Already complete:
// - Extracted with production-grade OCR
// - Embedded for vector search
// - Entities extracted
// - Knowledge graph updated
// - Search indexed
// - Error handled automatically

// Query immediately
const contents = await client.queryContents({
    search: "revenue growth"
});

// RAG conversation ready
const response = await client.promptConversation(
    "Summarize the key financial highlights",
    conversationId,
    { id: specificationId }
);

Summary

Docling is an interesting open-source project for experimentation. It can work on simple documents and helps developers understand document processing concepts.

But for production applications, Docling's limitations become blockers:

Unreliable extraction quality
Hanging and timeout issues
No scanned document support
Missing infrastructure (embeddings, entities, search, conversations)

Graphlit provides production-grade extraction through proven backends (Azure AI, Claude, Reducto) plus complete semantic infrastructure. The extraction works reliably, and everything you need to build AI applications is included.

The choice is clear: experiment with Docling, build with Graphlit.

Explore Graphlit Features:

Document Processing — Production extraction backends
Building Knowledge Graphs — Automatic entity extraction
Complete Guide to Search — Hybrid semantic search
PDF Extraction Comparison — Benchmark results

Learn More:

Open-source is great for learning. Production requires infrastructure that works.