Document Processing: PDFs, Word, and Text Extraction

Document processing—extracting text from PDFs, Word files, and scanned images—is foundational for building knowledge bases. Graphlit offers two extraction strategies: traditional OCR (fast, cheap) and vision models (slower, higher quality). This guide helps you choose the right approach.

What You'll Learn

OCR vs vision model extraction
Handling complex PDFs (tables, multi-column, scanned)
Word, Excel, PowerPoint processing
Document metadata extraction
Quality optimization techniques
Production patterns for large document libraries

Part 1: The Two Extraction Strategies

Traditional OCR (Fast & Cheap)

import { Graphlit } from 'graphlit-client';
import { FilePreparationServiceTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// OCR-based workflow
const ocrWorkflow = await graphlit.createWorkflow({
  name: "OCR Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document  // Traditional OCR
      }
    }]
  }
});

// Ingest with OCR
const content = await graphlit.ingestUri(
  'https://example.com/simple-document.pdf',
  'Simple PDF',
  undefined,
  undefined,
  undefined,
  { id: ocrWorkflow.createWorkflow.id }
);

Good for:

Text-based PDFs
Simple layouts
Fast batch processing
Cost-sensitive applications

Not good for:

Scanned documents
Complex tables
Multi-column layouts
Handwritten text
Forms with checkboxes

Vision Models (High Quality)

// Vision-based workflow
const visionWorkflow = await graphlit.createWorkflow({
  name: "Vision Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // GPT-4 Vision
      }
    }]
  }
});

// Ingest with vision
const content = await graphlit.ingestUri(
  'https://example.com/complex-document.pdf',
  'Complex PDF',
  undefined,
  undefined,
  undefined,
  { id: visionWorkflow.createWorkflow.id }
);

Good for:

Scanned PDFs
Complex tables
Multi-column layouts
Forms and checkboxes
Handwritten text (to some degree)
Mixed text and images

Not good for:

Simple text PDFs (overkill)
Large-scale batch processing (expensive)
Real-time applications (slower)

Part 2: Document Types

PDF Documents

Text-based PDFs:

// Use OCR for text-based PDFs
const textPdf = await graphlit.ingestUri(
  'https://example.com/text-document.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: ocrWorkflowId }
);

Scanned PDFs:

// Use vision for scanned PDFs
const scannedPdf = await graphlit.ingestUri(
  'https://example.com/scanned-document.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

Check extracted text quality:

const content = await graphlit.getContent(contentId);

console.log('Pages extracted:', content.content.pages?.length);
console.log('First page text:', content.content.pages?.[0]?.chunks?.[0]?.text);

if (content.content.document) {
  console.log('Title:', content.content.document.title);
  console.log('Author:', content.content.document.author);
  console.log('Page count:', content.content.document.pageCount);
}

Word Documents

// Word docs use same workflow
const wordDoc = await graphlit.ingestUri(
  'https://example.com/document.docx',
  'Word Document',
  undefined,
  undefined,
  undefined,
  { id: ocrWorkflowId }  // OCR works fine for Word
);

What gets extracted:

Text content
Headers and footers
Tables (converted to text)
Embedded images (optional)
Document metadata (author, created date, etc.)

Excel Spreadsheets

const excel = await graphlit.ingestUri(
  'https://example.com/data.xlsx',
  'Excel Sheet'
);

What gets extracted:

Cell values (converted to text)
Sheet names
Formulas (as text)
Table structures preserved as markdown

PowerPoint Presentations

const pptx = await graphlit.ingestUri(
  'https://example.com/presentation.pptx',
  'Presentation'
);

What gets extracted:

Slide text
Speaker notes
Slide order
Embedded images (optional)

Part 3: Handling Complex Documents

Tables

Problem: Tables often get mangled in extraction.

Solution: Use vision models:

const tableWorkflow = await graphlit.createWorkflow({
  name: "Table Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // Vision preserves tables
      }
    }]
  }
});

Result: Tables extracted as structured markdown:

| Product | Q1 | Q2 | Q3 | Q4 |
|---------|----|----|----|----|
| Widget A | $100k | $120k | $150k | $180k |
| Widget B | $80k | $90k | $95k | $110k |

Multi-Column Layouts

Problem: OCR reads left-to-right, breaking column flow.

Solution: Vision models understand layout:

// Vision maintains reading order across columns
const multiCol = await graphlit.ingestUri(
  'https://example.com/newspaper.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

Scanned Documents

Problem: OCR fails on low-quality scans.

Solution: Vision models are more robust:

// Works with scanned, faxed, photocopied docs
const scanned = await graphlit.ingestUri(
  'https://example.com/scanned-contract.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

Tip: For very poor quality, pre-process with image enhancement before ingestion.

Forms with Checkboxes

Problem: OCR doesn't detect checkbox states.

Solution: Vision models see visual elements:

// Extracts "☑ Agree to terms" vs "☐ Agree to terms"
const form = await graphlit.ingestUri(
  'https://example.com/application-form.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

Part 4: Metadata Extraction

Document Metadata

const content = await graphlit.getContent(contentId);

if (content.content.document) {
  console.log('Title:', content.content.document.title);
  console.log('Author:', content.content.document.author);
  console.log('Subject:', content.content.document.subject);
  console.log('Keywords:', content.content.document.keywords);
  console.log('Created:', content.content.document.createdDate);
  console.log('Modified:', content.content.document.modifiedDate);
  console.log('Page count:', content.content.document.pageCount);
  console.log('Language:', content.content.document.language);
}

Filter by Document Properties

import { ContentTypes } from 'graphlit-client/dist/generated/graphql-types';

// Find PDFs by author
const authorDocs = await graphlit.queryContents({
  search: 'machine learning',
  filter: {
    types: [ContentTypes.File],
    fileTypes: [FileTypes.Document],
    // Note: author filtering requires custom implementation
  }
});

Part 5: Quality Optimization

Compare Extraction Quality

// Test both methods on same document
const ocrResult = await graphlit.ingestUri(
  testUrl,
  'OCR Test',
  undefined,
  undefined,
  undefined,
  { id: ocrWorkflowId }
);

const visionResult = await graphlit.ingestUri(
  testUrl,
  'Vision Test',
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

// Wait for both
await Promise.all([
  waitForContent(ocrResult.ingestUri.id),
  waitForContent(visionResult.ingestUri.id)
]);

// Compare
const ocrContent = await graphlit.getContent(ocrResult.ingestUri.id);
const visionContent = await graphlit.getContent(visionResult.ingestUri.id);

console.log('OCR extracted:', ocrContent.content.pages?.length, 'pages');
console.log('Vision extracted:', visionContent.content.pages?.length, 'pages');

Hybrid Approach

// Use OCR first, fall back to vision if quality is poor
async function smartIngest(uri: string) {
  // Try OCR first
  const ocrContent = await graphlit.ingestUri(uri, undefined, undefined, undefined, undefined, { id: ocrWorkflowId });
  await waitForContent(ocrContent.ingestUri.id);
  
  const content = await graphlit.getContent(ocrContent.ingestUri.id);
  
  // Check quality (e.g., text length)
  const textLength = content.content.pages?.[0]?.chunks?.[0]?.text?.length || 0;
  
  if (textLength < 100) {
    // Quality seems poor, delete and retry with vision
    console.log('OCR quality poor, retrying with vision...');
    await graphlit.deleteContent(ocrContent.ingestUri.id);
    
    return graphlit.ingestUri(uri, undefined, undefined, undefined, undefined, { id: visionWorkflowId });
  }
  
  return ocrContent;
}

Part 6: Production Patterns

Batch Document Processing

async function processBatch(urls: string[]) {
  const results = [];
  
  for (const url of urls) {
    try {
      // Classify document type
      const isPdf = url.endsWith('.pdf');
      const isWord = url.match(/\.(docx?|doc)$/);
      
      // Choose workflow
      const workflowId = isPdf ? visionWorkflowId : ocrWorkflowId;
      
      const content = await graphlit.ingestUri(url, undefined, undefined, undefined, undefined, { id: workflowId });
      
      results.push({
        url,
        contentId: content.ingestUri.id,
        status: 'ingested'
      });
      
      // Rate limit
      await new Promise(r => setTimeout(r, 1000));
    } catch (error: any) {
      results.push({
        url,
        status: 'failed',
        error: error.message
      });
    }
  }
  
  return results;
}

Document Quality Monitoring

async function monitorQuality(contentId: string) {
  const content = await graphlit.getContent(contentId);
  
  const quality = {
    pagesExtracted: content.content.pages?.length || 0,
    avgChunkLength: 0,
    hasMetadata: !!content.content.document,
    warnings: [] as string[]
  };
  
  // Calculate average chunk length
  if (content.content.pages) {
    const totalLength = content.content.pages.reduce((sum, page) => {
      return sum + (page.chunks?.reduce((chunkSum, chunk) => chunkSum + chunk.text.length, 0) || 0);
    }, 0);
    const totalChunks = content.content.pages.reduce((sum, page) => sum + (page.chunks?.length || 0), 0);
    quality.avgChunkLength = totalLength / totalChunks;
  }
  
  // Add warnings
  if (quality.pagesExtracted === 0) {
    quality.warnings.push('No pages extracted');
  }
  if (quality.avgChunkLength < 100) {
    quality.warnings.push('Chunks suspiciously short - possible extraction issue');
  }
  
  return quality;
}

Common Issues & Solutions

Issue: Gibberish Text Extracted

Solution: Switch to vision model:

// OCR failed on this PDF encoding
// Use vision instead
{ id: visionWorkflowId }

Issue: Tables Extracted as Jumbled Text

Problem: Table cells read in wrong order.

Solution: Vision models maintain structure:

{ id: visionWorkflowId }  // Tables preserved as markdown

Issue: Missing Content

Problem: Some pages appear blank.

Solutions:

Check if PDF has text layer:

# Use pdftotext to test
pdftotext document.pdf -

Use vision for scanned pages:

{ id: visionWorkflowId }

Issue: Slow Processing

Problem: Documents take > 10 minutes.

Solutions:

Use OCR for simple docs (10x faster)
Process in batches with delays
Use webhooks instead of polling

What's Next?

You now understand document processing. Next steps:

Experiment with both extraction methods on your documents
Monitor quality to optimize workflow selection
Batch process document libraries efficiently

Related guides:

Content Ingestion - Ingestion basics
Workflows and Processing - Customize extraction
Building Knowledge Graphs - Extract entities from documents

Happy processing! 📄

Document Processing: PDFs, Word, and Text Extraction

What You'll Learn

Part 1: The Two Extraction Strategies

Traditional OCR (Fast & Cheap)

Vision Models (High Quality)

Part 2: Document Types

PDF Documents

Word Documents

Excel Spreadsheets

PowerPoint Presentations

Part 3: Handling Complex Documents

Tables

Multi-Column Layouts

Scanned Documents

Forms with Checkboxes

Part 4: Metadata Extraction

Document Metadata

Filter by Document Properties

Part 5: Quality Optimization

Compare Extraction Quality

Hybrid Approach

Part 6: Production Patterns

Batch Document Processing

Document Quality Monitoring

Common Issues & Solutions

Issue: Gibberish Text Extracted

Issue: Tables Extracted as Jumbled Text

Issue: Missing Content

Issue: Slow Processing

What's Next?

Ready to Build with Graphlit?