Specialized16 min read

Document Processing: PDFs, Word, and Text Extraction

Master document processing with Graphlit. Deep dive into PDF extraction, OCR, table detection, Word documents, and multimodal extraction with vision models.

Document processing—extracting text from PDFs, Word files, and scanned images—is foundational for building knowledge bases. Graphlit offers two extraction strategies: traditional OCR (fast, cheap) and vision models (slower, higher quality). This guide helps you choose the right approach.

What You'll Learn

  • OCR vs vision model extraction
  • Handling complex PDFs (tables, multi-column, scanned)
  • Word, Excel, PowerPoint processing
  • Document metadata extraction
  • Quality optimization techniques
  • Production patterns for large document libraries

Part 1: The Two Extraction Strategies

Traditional OCR (Fast & Cheap)

import { Graphlit } from 'graphlit-client';
import { FilePreparationServiceTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// OCR-based workflow
const ocrWorkflow = await graphlit.createWorkflow({
  name: "OCR Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document  // Traditional OCR
      }
    }]
  }
});

// Ingest with OCR
const content = await graphlit.ingestUri(
  'https://example.com/simple-document.pdf',
  'Simple PDF',
  undefined,
  undefined,
  undefined,
  { id: ocrWorkflow.createWorkflow.id }
);

Good for:

  • Text-based PDFs
  • Simple layouts
  • Fast batch processing
  • Cost-sensitive applications

Not good for:

  • Scanned documents
  • Complex tables
  • Multi-column layouts
  • Handwritten text
  • Forms with checkboxes

Vision Models (High Quality)

// Vision-based workflow
const visionWorkflow = await graphlit.createWorkflow({
  name: "Vision Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // GPT-4 Vision
      }
    }]
  }
});

// Ingest with vision
const content = await graphlit.ingestUri(
  'https://example.com/complex-document.pdf',
  'Complex PDF',
  undefined,
  undefined,
  undefined,
  { id: visionWorkflow.createWorkflow.id }
);

Good for:

  • Scanned PDFs
  • Complex tables
  • Multi-column layouts
  • Forms and checkboxes
  • Handwritten text (to some degree)
  • Mixed text and images

Not good for:

  • Simple text PDFs (overkill)
  • Large-scale batch processing (expensive)
  • Real-time applications (slower)

Part 2: Document Types

PDF Documents

Text-based PDFs:

// Use OCR for text-based PDFs
const textPdf = await graphlit.ingestUri(
  'https://example.com/text-document.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: ocrWorkflowId }
);

Scanned PDFs:

// Use vision for scanned PDFs
const scannedPdf = await graphlit.ingestUri(
  'https://example.com/scanned-document.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

Check extracted text quality:

const content = await graphlit.getContent(contentId);

console.log('Pages extracted:', content.content.pages?.length);
console.log('First page text:', content.content.pages?.[0]?.chunks?.[0]?.text);

if (content.content.document) {
  console.log('Title:', content.content.document.title);
  console.log('Author:', content.content.document.author);
  console.log('Page count:', content.content.document.pageCount);
}

Word Documents

// Word docs use same workflow
const wordDoc = await graphlit.ingestUri(
  'https://example.com/document.docx',
  'Word Document',
  undefined,
  undefined,
  undefined,
  { id: ocrWorkflowId }  // OCR works fine for Word
);

What gets extracted:

  • Text content
  • Headers and footers
  • Tables (converted to text)
  • Embedded images (optional)
  • Document metadata (author, created date, etc.)

Excel Spreadsheets

const excel = await graphlit.ingestUri(
  'https://example.com/data.xlsx',
  'Excel Sheet'
);

What gets extracted:

  • Cell values (converted to text)
  • Sheet names
  • Formulas (as text)
  • Table structures preserved as markdown

PowerPoint Presentations

const pptx = await graphlit.ingestUri(
  'https://example.com/presentation.pptx',
  'Presentation'
);

What gets extracted:

  • Slide text
  • Speaker notes
  • Slide order
  • Embedded images (optional)

Part 3: Handling Complex Documents

Tables

Problem: Tables often get mangled in extraction.

Solution: Use vision models:

const tableWorkflow = await graphlit.createWorkflow({
  name: "Table Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // Vision preserves tables
      }
    }]
  }
});

Result: Tables extracted as structured markdown:

| Product | Q1 | Q2 | Q3 | Q4 |
|---------|----|----|----|----|
| Widget A | $100k | $120k | $150k | $180k |
| Widget B | $80k | $90k | $95k | $110k |

Multi-Column Layouts

Problem: OCR reads left-to-right, breaking column flow.

Solution: Vision models understand layout:

// Vision maintains reading order across columns
const multiCol = await graphlit.ingestUri(
  'https://example.com/newspaper.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

Scanned Documents

Problem: OCR fails on low-quality scans.

Solution: Vision models are more robust:

// Works with scanned, faxed, photocopied docs
const scanned = await graphlit.ingestUri(
  'https://example.com/scanned-contract.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

Tip: For very poor quality, pre-process with image enhancement before ingestion.

Forms with Checkboxes

Problem: OCR doesn't detect checkbox states.

Solution: Vision models see visual elements:

// Extracts "☑ Agree to terms" vs "☐ Agree to terms"
const form = await graphlit.ingestUri(
  'https://example.com/application-form.pdf',
  undefined,
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

Part 4: Metadata Extraction

Document Metadata

const content = await graphlit.getContent(contentId);

if (content.content.document) {
  console.log('Title:', content.content.document.title);
  console.log('Author:', content.content.document.author);
  console.log('Subject:', content.content.document.subject);
  console.log('Keywords:', content.content.document.keywords);
  console.log('Created:', content.content.document.createdDate);
  console.log('Modified:', content.content.document.modifiedDate);
  console.log('Page count:', content.content.document.pageCount);
  console.log('Language:', content.content.document.language);
}

Filter by Document Properties

import { ContentTypes } from 'graphlit-client/dist/generated/graphql-types';

// Find PDFs by author
const authorDocs = await graphlit.queryContents({
  search: 'machine learning',
  filter: {
    types: [ContentTypes.File],
    fileTypes: [FileTypes.Document],
    // Note: author filtering requires custom implementation
  }
});

Part 5: Quality Optimization

Compare Extraction Quality

// Test both methods on same document
const ocrResult = await graphlit.ingestUri(
  testUrl,
  'OCR Test',
  undefined,
  undefined,
  undefined,
  { id: ocrWorkflowId }
);

const visionResult = await graphlit.ingestUri(
  testUrl,
  'Vision Test',
  undefined,
  undefined,
  undefined,
  { id: visionWorkflowId }
);

// Wait for both
await Promise.all([
  waitForContent(ocrResult.ingestUri.id),
  waitForContent(visionResult.ingestUri.id)
]);

// Compare
const ocrContent = await graphlit.getContent(ocrResult.ingestUri.id);
const visionContent = await graphlit.getContent(visionResult.ingestUri.id);

console.log('OCR extracted:', ocrContent.content.pages?.length, 'pages');
console.log('Vision extracted:', visionContent.content.pages?.length, 'pages');

Hybrid Approach

// Use OCR first, fall back to vision if quality is poor
async function smartIngest(uri: string) {
  // Try OCR first
  const ocrContent = await graphlit.ingestUri(uri, undefined, undefined, undefined, undefined, { id: ocrWorkflowId });
  await waitForContent(ocrContent.ingestUri.id);
  
  const content = await graphlit.getContent(ocrContent.ingestUri.id);
  
  // Check quality (e.g., text length)
  const textLength = content.content.pages?.[0]?.chunks?.[0]?.text?.length || 0;
  
  if (textLength < 100) {
    // Quality seems poor, delete and retry with vision
    console.log('OCR quality poor, retrying with vision...');
    await graphlit.deleteContent(ocrContent.ingestUri.id);
    
    return graphlit.ingestUri(uri, undefined, undefined, undefined, undefined, { id: visionWorkflowId });
  }
  
  return ocrContent;
}

Part 6: Production Patterns

Batch Document Processing

async function processBatch(urls: string[]) {
  const results = [];
  
  for (const url of urls) {
    try {
      // Classify document type
      const isPdf = url.endsWith('.pdf');
      const isWord = url.match(/\.(docx?|doc)$/);
      
      // Choose workflow
      const workflowId = isPdf ? visionWorkflowId : ocrWorkflowId;
      
      const content = await graphlit.ingestUri(url, undefined, undefined, undefined, undefined, { id: workflowId });
      
      results.push({
        url,
        contentId: content.ingestUri.id,
        status: 'ingested'
      });
      
      // Rate limit
      await new Promise(r => setTimeout(r, 1000));
    } catch (error: any) {
      results.push({
        url,
        status: 'failed',
        error: error.message
      });
    }
  }
  
  return results;
}

Document Quality Monitoring

async function monitorQuality(contentId: string) {
  const content = await graphlit.getContent(contentId);
  
  const quality = {
    pagesExtracted: content.content.pages?.length || 0,
    avgChunkLength: 0,
    hasMetadata: !!content.content.document,
    warnings: [] as string[]
  };
  
  // Calculate average chunk length
  if (content.content.pages) {
    const totalLength = content.content.pages.reduce((sum, page) => {
      return sum + (page.chunks?.reduce((chunkSum, chunk) => chunkSum + chunk.text.length, 0) || 0);
    }, 0);
    const totalChunks = content.content.pages.reduce((sum, page) => sum + (page.chunks?.length || 0), 0);
    quality.avgChunkLength = totalLength / totalChunks;
  }
  
  // Add warnings
  if (quality.pagesExtracted === 0) {
    quality.warnings.push('No pages extracted');
  }
  if (quality.avgChunkLength < 100) {
    quality.warnings.push('Chunks suspiciously short - possible extraction issue');
  }
  
  return quality;
}

Common Issues & Solutions

Issue: Gibberish Text Extracted

Problem: Extracted text is garbled: "T∂e qu√ck br©wn f©x..."

Solution: Switch to vision model:

// OCR failed on this PDF encoding
// Use vision instead
{ id: visionWorkflowId }

Issue: Tables Extracted as Jumbled Text

Problem: Table cells read in wrong order.

Solution: Vision models maintain structure:

{ id: visionWorkflowId }  // Tables preserved as markdown

Issue: Missing Content

Problem: Some pages appear blank.

Solutions:

  1. Check if PDF has text layer:
# Use pdftotext to test
pdftotext document.pdf -
  1. Use vision for scanned pages:
{ id: visionWorkflowId }

Issue: Slow Processing

Problem: Documents take > 10 minutes.

Solutions:

  1. Use OCR for simple docs (10x faster)
  2. Process in batches with delays
  3. Use webhooks instead of polling

What's Next?

You now understand document processing. Next steps:

  1. Experiment with both extraction methods on your documents
  2. Monitor quality to optimize workflow selection
  3. Batch process document libraries efficiently

Related guides:

Happy processing! 📄

Ready to Build with Graphlit?

Start building AI-powered applications with our API-first platform. Free tier includes 100 credits/month — no credit card required.

No credit card required • 5 minutes to first API call

Document Processing: PDFs, Word, and Text Extraction | Graphlit Developer Guides