Document processing—extracting text from PDFs, Word files, and scanned images—is foundational for building knowledge bases. Graphlit offers two extraction strategies: traditional OCR (fast, cheap) and vision models (slower, higher quality). This guide helps you choose the right approach.
What You'll Learn
- OCR vs vision model extraction
- Handling complex PDFs (tables, multi-column, scanned)
- Word, Excel, PowerPoint processing
- Document metadata extraction
- Quality optimization techniques
- Production patterns for large document libraries
Part 1: The Two Extraction Strategies
Traditional OCR (Fast & Cheap)
import { Graphlit } from 'graphlit-client';
import { FilePreparationServiceTypes } from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
// OCR-based workflow
const ocrWorkflow = await graphlit.createWorkflow({
name: "OCR Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.Document // Traditional OCR
}
}]
}
});
// Ingest with OCR
const content = await graphlit.ingestUri(
'https://example.com/simple-document.pdf',
'Simple PDF',
undefined,
undefined,
undefined,
{ id: ocrWorkflow.createWorkflow.id }
);
Good for:
- Text-based PDFs
- Simple layouts
- Fast batch processing
- Cost-sensitive applications
Not good for:
- Scanned documents
- Complex tables
- Multi-column layouts
- Handwritten text
- Forms with checkboxes
Vision Models (High Quality)
// Vision-based workflow
const visionWorkflow = await graphlit.createWorkflow({
name: "Vision Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument // GPT-4 Vision
}
}]
}
});
// Ingest with vision
const content = await graphlit.ingestUri(
'https://example.com/complex-document.pdf',
'Complex PDF',
undefined,
undefined,
undefined,
{ id: visionWorkflow.createWorkflow.id }
);
Good for:
- Scanned PDFs
- Complex tables
- Multi-column layouts
- Forms and checkboxes
- Handwritten text (to some degree)
- Mixed text and images
Not good for:
- Simple text PDFs (overkill)
- Large-scale batch processing (expensive)
- Real-time applications (slower)
Part 2: Document Types
PDF Documents
Text-based PDFs:
// Use OCR for text-based PDFs
const textPdf = await graphlit.ingestUri(
'https://example.com/text-document.pdf',
undefined,
undefined,
undefined,
undefined,
{ id: ocrWorkflowId }
);
Scanned PDFs:
// Use vision for scanned PDFs
const scannedPdf = await graphlit.ingestUri(
'https://example.com/scanned-document.pdf',
undefined,
undefined,
undefined,
undefined,
{ id: visionWorkflowId }
);
Check extracted text quality:
const content = await graphlit.getContent(contentId);
console.log('Pages extracted:', content.content.pages?.length);
console.log('First page text:', content.content.pages?.[0]?.chunks?.[0]?.text);
if (content.content.document) {
console.log('Title:', content.content.document.title);
console.log('Author:', content.content.document.author);
console.log('Page count:', content.content.document.pageCount);
}
Word Documents
// Word docs use same workflow
const wordDoc = await graphlit.ingestUri(
'https://example.com/document.docx',
'Word Document',
undefined,
undefined,
undefined,
{ id: ocrWorkflowId } // OCR works fine for Word
);
What gets extracted:
- Text content
- Headers and footers
- Tables (converted to text)
- Embedded images (optional)
- Document metadata (author, created date, etc.)
Excel Spreadsheets
const excel = await graphlit.ingestUri(
'https://example.com/data.xlsx',
'Excel Sheet'
);
What gets extracted:
- Cell values (converted to text)
- Sheet names
- Formulas (as text)
- Table structures preserved as markdown
PowerPoint Presentations
const pptx = await graphlit.ingestUri(
'https://example.com/presentation.pptx',
'Presentation'
);
What gets extracted:
- Slide text
- Speaker notes
- Slide order
- Embedded images (optional)
Part 3: Handling Complex Documents
Tables
Problem: Tables often get mangled in extraction.
Solution: Use vision models:
const tableWorkflow = await graphlit.createWorkflow({
name: "Table Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument // Vision preserves tables
}
}]
}
});
Result: Tables extracted as structured markdown:
| Product | Q1 | Q2 | Q3 | Q4 |
|---------|----|----|----|----|
| Widget A | $100k | $120k | $150k | $180k |
| Widget B | $80k | $90k | $95k | $110k |
Multi-Column Layouts
Problem: OCR reads left-to-right, breaking column flow.
Solution: Vision models understand layout:
// Vision maintains reading order across columns
const multiCol = await graphlit.ingestUri(
'https://example.com/newspaper.pdf',
undefined,
undefined,
undefined,
undefined,
{ id: visionWorkflowId }
);
Scanned Documents
Problem: OCR fails on low-quality scans.
Solution: Vision models are more robust:
// Works with scanned, faxed, photocopied docs
const scanned = await graphlit.ingestUri(
'https://example.com/scanned-contract.pdf',
undefined,
undefined,
undefined,
undefined,
{ id: visionWorkflowId }
);
Tip: For very poor quality, pre-process with image enhancement before ingestion.
Forms with Checkboxes
Problem: OCR doesn't detect checkbox states.
Solution: Vision models see visual elements:
// Extracts "☑ Agree to terms" vs "☐ Agree to terms"
const form = await graphlit.ingestUri(
'https://example.com/application-form.pdf',
undefined,
undefined,
undefined,
undefined,
{ id: visionWorkflowId }
);
Part 4: Metadata Extraction
Document Metadata
const content = await graphlit.getContent(contentId);
if (content.content.document) {
console.log('Title:', content.content.document.title);
console.log('Author:', content.content.document.author);
console.log('Subject:', content.content.document.subject);
console.log('Keywords:', content.content.document.keywords);
console.log('Created:', content.content.document.createdDate);
console.log('Modified:', content.content.document.modifiedDate);
console.log('Page count:', content.content.document.pageCount);
console.log('Language:', content.content.document.language);
}
Filter by Document Properties
import { ContentTypes } from 'graphlit-client/dist/generated/graphql-types';
// Find PDFs by author
const authorDocs = await graphlit.queryContents({
search: 'machine learning',
filter: {
types: [ContentTypes.File],
fileTypes: [FileTypes.Document],
// Note: author filtering requires custom implementation
}
});
Part 5: Quality Optimization
Compare Extraction Quality
// Test both methods on same document
const ocrResult = await graphlit.ingestUri(
testUrl,
'OCR Test',
undefined,
undefined,
undefined,
{ id: ocrWorkflowId }
);
const visionResult = await graphlit.ingestUri(
testUrl,
'Vision Test',
undefined,
undefined,
undefined,
{ id: visionWorkflowId }
);
// Wait for both
await Promise.all([
waitForContent(ocrResult.ingestUri.id),
waitForContent(visionResult.ingestUri.id)
]);
// Compare
const ocrContent = await graphlit.getContent(ocrResult.ingestUri.id);
const visionContent = await graphlit.getContent(visionResult.ingestUri.id);
console.log('OCR extracted:', ocrContent.content.pages?.length, 'pages');
console.log('Vision extracted:', visionContent.content.pages?.length, 'pages');
Hybrid Approach
// Use OCR first, fall back to vision if quality is poor
async function smartIngest(uri: string) {
// Try OCR first
const ocrContent = await graphlit.ingestUri(uri, undefined, undefined, undefined, undefined, { id: ocrWorkflowId });
await waitForContent(ocrContent.ingestUri.id);
const content = await graphlit.getContent(ocrContent.ingestUri.id);
// Check quality (e.g., text length)
const textLength = content.content.pages?.[0]?.chunks?.[0]?.text?.length || 0;
if (textLength < 100) {
// Quality seems poor, delete and retry with vision
console.log('OCR quality poor, retrying with vision...');
await graphlit.deleteContent(ocrContent.ingestUri.id);
return graphlit.ingestUri(uri, undefined, undefined, undefined, undefined, { id: visionWorkflowId });
}
return ocrContent;
}
Part 6: Production Patterns
Batch Document Processing
async function processBatch(urls: string[]) {
const results = [];
for (const url of urls) {
try {
// Classify document type
const isPdf = url.endsWith('.pdf');
const isWord = url.match(/\.(docx?|doc)$/);
// Choose workflow
const workflowId = isPdf ? visionWorkflowId : ocrWorkflowId;
const content = await graphlit.ingestUri(url, undefined, undefined, undefined, undefined, { id: workflowId });
results.push({
url,
contentId: content.ingestUri.id,
status: 'ingested'
});
// Rate limit
await new Promise(r => setTimeout(r, 1000));
} catch (error: any) {
results.push({
url,
status: 'failed',
error: error.message
});
}
}
return results;
}
Document Quality Monitoring
async function monitorQuality(contentId: string) {
const content = await graphlit.getContent(contentId);
const quality = {
pagesExtracted: content.content.pages?.length || 0,
avgChunkLength: 0,
hasMetadata: !!content.content.document,
warnings: [] as string[]
};
// Calculate average chunk length
if (content.content.pages) {
const totalLength = content.content.pages.reduce((sum, page) => {
return sum + (page.chunks?.reduce((chunkSum, chunk) => chunkSum + chunk.text.length, 0) || 0);
}, 0);
const totalChunks = content.content.pages.reduce((sum, page) => sum + (page.chunks?.length || 0), 0);
quality.avgChunkLength = totalLength / totalChunks;
}
// Add warnings
if (quality.pagesExtracted === 0) {
quality.warnings.push('No pages extracted');
}
if (quality.avgChunkLength < 100) {
quality.warnings.push('Chunks suspiciously short - possible extraction issue');
}
return quality;
}
Common Issues & Solutions
Issue: Gibberish Text Extracted
Problem: Extracted text is garbled: "T∂e qu√ck br©wn f©x..."
Solution: Switch to vision model:
// OCR failed on this PDF encoding
// Use vision instead
{ id: visionWorkflowId }
Issue: Tables Extracted as Jumbled Text
Problem: Table cells read in wrong order.
Solution: Vision models maintain structure:
{ id: visionWorkflowId } // Tables preserved as markdown
Issue: Missing Content
Problem: Some pages appear blank.
Solutions:
- Check if PDF has text layer:
# Use pdftotext to test
pdftotext document.pdf -
- Use vision for scanned pages:
{ id: visionWorkflowId }
Issue: Slow Processing
Problem: Documents take > 10 minutes.
Solutions:
- Use OCR for simple docs (10x faster)
- Process in batches with delays
- Use webhooks instead of polling
What's Next?
You now understand document processing. Next steps:
- Experiment with both extraction methods on your documents
- Monitor quality to optimize workflow selection
- Batch process document libraries efficiently
Related guides:
- Content Ingestion - Ingestion basics
- Workflows and Processing - Customize extraction
- Building Knowledge Graphs - Extract entities from documents
Happy processing! 📄