Everything in Graphlit starts with ingestion—bringing content from the outside world into your knowledge base. Whether you're uploading PDFs, scraping web pages, or ingesting text snippets, understanding how ingestion works is critical for building production applications.
This guide covers the three ingestion methods, content vs file types, metadata extraction, lifecycle states, polling strategies, and batch operations. By the end, you'll know exactly how to get any content type into Graphlit efficiently.
What You'll Learn
- Three ingestion methods: URI, file upload, and text
- Content types vs file types (EMAIL, MESSAGE, FILE, PAGE, etc.)
- Metadata extraction by content type
- Content lifecycle states (CREATED → PROCESSING → INDEXED)
- Polling vs webhooks for completion tracking
- Batch ingestion patterns
- Error handling and retry strategies
Prerequisites: A Graphlit project, SDK installed.
Developer Note: All Graphlit IDs are GUIDs. Example outputs show realistic GUID format.
Part 1: The Three Ingestion Methods
Method 1: Ingest from URI (Most Common)
Ingest content from any publicly accessible URL:
import { Graphlit } from 'graphlit-client';
const graphlit = new Graphlit();
// Ingest a PDF from URL
const pdf = await graphlit.ingestUri(
'https://example.com/document.pdf',
'Annual Report 2024' // Optional name
);
console.log('Content ID:', pdf.ingestUri.id);
console.log('State:', pdf.ingestUri.state); // CREATED
// Ingest a web page
const page = await graphlit.ingestUri(
'https://example.com/blog/ai-trends',
'AI Trends Article'
);
// Ingest audio/video
const video = await graphlit.ingestUri(
'https://example.com/webinar.mp4',
'Product Webinar'
);
Supports:
- Documents (PDF, Word, Excel, PowerPoint)
- Web pages (HTML)
- Images (JPEG, PNG, GIF, etc.)
- Audio (MP3, WAV, podcast feeds)
- Video (MP4, MOV, etc.)
- Archives (ZIP containing files)
Use cases:
- Scraping documentation sites
- Importing research papers from URLs
- Processing podcast RSS feeds
- Ingesting social media links
Method 2: Upload File (Direct Upload)
Upload files directly from your app or local filesystem:
import fs from 'fs';
// Read file to base64
const fileBuffer = fs.readFileSync('/path/to/document.pdf');
const fileData = fileBuffer.toString('base64');
// Upload file
const upload = await graphlit.ingestEncodedFile(
'document.pdf',
fileData,
'application/pdf',
'Uploaded Document'
);
console.log('Uploaded:', upload.ingestEncodedFile.id);
Browser example (React):
async function handleFileUpload(file: File) {
// Convert file to base64
const arrayBuffer = await file.arrayBuffer();
const base64 = btoa(String.fromCharCode(...new Uint8Array(arrayBuffer)));
// Upload to Graphlit
const result = await graphlit.ingestEncodedFile(
file.name,
base64,
file.type,
file.name
);
return result.ingestEncodedFile.id;
}
// In your component
<input
type="file"
onChange={async (e) => {
const file = e.target.files?.[0];
if (file) {
const contentId = await handleFileUpload(file);
console.log('Uploaded:', contentId);
}
}}
/>
Use cases:
- User file uploads
- Document management systems
- Local file processing
Method 3: Ingest Text (Plain Text/Markdown)
Ingest raw text or markdown without a file:
// Ingest plain text
const text = await graphlit.ingestText(
'This is my note about quantum computing...',
'Quantum Computing Notes',
false // isMarkdown
);
// Ingest markdown
const markdown = await graphlit.ingestText(
'# Project Plan\n\n## Phase 1\n- Task 1\n- Task 2',
'Project Plan',
true // isMarkdown
);
console.log('Text content ID:', text.ingestText.id);
Use cases:
- Note-taking apps
- User-generated content
- Chat messages
- Code snippets
Part 2: Content Types vs File Types
Understanding the difference is critical for filtering and metadata access.
Content Types (Primary Classification)
What it represents (semantic meaning):
import { ContentTypes } from 'graphlit-client/dist/generated/graphql-types';
// Content types:
ContentTypes.Email // Email messages
ContentTypes.Message // Slack, Teams, Discord messages
ContentTypes.Page // Web pages
ContentTypes.File // Physical files (PDFs, images, videos, etc.)
ContentTypes.Post // Social media posts, RSS
ContentTypes.Event // Calendar events
ContentTypes.Issue // Jira, Linear, GitHub issues
ContentTypes.Text // Plain text, markdown
File Types (Secondary Classification)
Physical format (only when contentType = FILE):
import { FileTypes } from 'graphlit-client/dist/generated/graphql-types';
// File types:
FileTypes.Document // PDF, DOCX, XLSX, PPTX
FileTypes.Image // JPEG, PNG, GIF, TIFF
FileTypes.Audio // MP3, WAV, M4A
FileTypes.Video // MP4, MOV, AVI
FileTypes.Code // Python, JavaScript, etc.
FileTypes.Data // JSON, XML, CSV
FileTypes.Package // ZIP, TAR
The Hierarchy
Content
└─ type: ContentTypes (always present)
└─ fileType: FileTypes (only if type = FILE)
Examples:
// PDF document
content.type = ContentTypes.File
content.fileType = FileTypes.Document
// Email with PDF attachment
content.type = ContentTypes.Email
content.fileType = null // Emails don't have fileType
// Slack message with image
content.type = ContentTypes.Message
content.fileType = FileTypes.Image // The image attached
// Web page
content.type = ContentTypes.Page
content.fileType = null
Filtering by Type
// Get all emails
const emails = await graphlit.queryContents({
filter: {
types: [ContentTypes.Email]
}
});
// Get all PDF documents
const pdfs = await graphlit.queryContents({
filter: {
types: [ContentTypes.File],
fileTypes: [FileTypes.Document]
}
});
// Get all images (from any source)
const images = await graphlit.queryContents({
filter: {
fileTypes: [FileTypes.Image]
}
});
Type-Specific Metadata
Each content type has a metadata field:
const content = await graphlit.getContent(contentId);
switch (content.content.type) {
case ContentTypes.Email:
// Access email metadata
console.log('From:', content.content.email.from[0].email);
console.log('Subject:', content.content.email.subject);
console.log('Sent:', content.content.email.sentDateTime);
break;
case ContentTypes.Message:
// Access message metadata
console.log('Channel:', content.content.message.channelName);
console.log('Author:', content.content.message.author);
console.log('Platform:', content.content.message.platform); // SLACK, TEAMS, etc.
break;
case ContentTypes.File:
if (content.content.fileType === FileTypes.Document) {
// Access document metadata
console.log('Pages:', content.content.document.pageCount);
console.log('Author:', content.content.document.author);
console.log('Title:', content.content.document.title);
}
break;
case ContentTypes.Event:
// Access calendar event metadata
console.log('Start:', content.content.event.startDateTime);
console.log('Attendees:', content.content.event.attendees.length);
break;
}
Part 3: Content Lifecycle States
Content goes through states during processing:
The State Machine
CREATED → PROCESSING → INDEXED
↓
(if error) → FAILED
States:
CREATED: Just ingested, waiting to processPROCESSING: Extracting text, generating embeddings, running workflowsINDEXED: Fully processed, searchableFAILED: Processing failed (check error message)
Checking State
const content = await graphlit.getContent(contentId);
console.log('State:', content.content.state);
// States from enum
import { EntityState } from 'graphlit-client/dist/generated/graphql-types';
if (content.content.state === EntityState.Indexed) {
console.log('✓ Content is ready for search');
}
Polling for Completion
// Method 1: Simple polling
async function waitForContent(contentId: string) {
let isDone = false;
while (!isDone) {
const status = await graphlit.isContentDone(contentId);
isDone = status.isContentDone.result;
if (!isDone) {
console.log('Still processing...');
await new Promise(resolve => setTimeout(resolve, 2000)); // 2s delay
}
}
console.log('✓ Content ready!');
}
// Usage
const content = await graphlit.ingestUri('https://example.com/doc.pdf');
await waitForContent(content.ingestUri.id);
With timeout:
async function waitForContentWithTimeout(contentId: string, timeoutMs = 60000) {
const startTime = Date.now();
while (Date.now() - startTime < timeoutMs) {
const status = await graphlit.isContentDone(contentId);
if (status.isContentDone.result) {
return true; // Success
}
await new Promise(resolve => setTimeout(resolve, 2000));
}
throw new Error('Timeout waiting for content');
}
Webhooks (Production Pattern)
Don't poll in production—use webhooks:
// When creating a feed or workflow, specify webhook URL
const feed = await graphlit.createFeed(
FeedServiceTypes.Rss,
workflowId,
{ uri: 'https://example.com/rss' },
'My Feed',
undefined,
'https://yourapp.com/webhooks/graphlit' // Your webhook endpoint
);
// Your webhook handler (Express.js)
app.post('/webhooks/graphlit', async (req, res) => {
const event = req.body;
if (event.type === 'content.done') {
console.log('Content ready:', event.contentId);
// Fetch full content
const content = await graphlit.getContent(event.contentId);
// Process it (index in your DB, trigger notifications, etc.)
await processContent(content);
}
res.sendStatus(200);
});
Part 4: Batch Operations
Batch Ingestion
// Ingest multiple URLs
const urls = [
'https://example.com/doc1.pdf',
'https://example.com/doc2.pdf',
'https://example.com/doc3.pdf'
];
const contentIds: string[] = [];
for (const url of urls) {
const result = await graphlit.ingestUri(url);
contentIds.push(result.ingestUri.id);
console.log(`Ingested: ${url}`);
}
// Wait for all to complete
await Promise.all(
contentIds.map(id => waitForContent(id))
);
console.log('✓ All content indexed');
Parallel ingestion:
// Ingest in parallel (faster)
const results = await Promise.all(
urls.map(url => graphlit.ingestUri(url))
);
const contentIds = results.map(r => r.ingestUri.id);
Batch Deletion
import { EntityState } from 'graphlit-client/dist/generated/graphlit-types';
// Get all content
const allContent = await graphlit.queryContents({
filter: {
states: [EntityState.Indexed, EntityState.Failed]
}
});
// Delete in batches
const contentIds = allContent.contents.results.map(c => c.id);
const deleted = await graphlit.deleteContents(contentIds);
console.log(`Deleted ${deleted.deleteContents.count} content items`);
Delete with filter:
// Delete all PDFs from last month
const oldPdfs = await graphlit.queryContents({
filter: {
types: [ContentTypes.File],
fileTypes: [FileTypes.Document],
creationDateRange: {
from: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).toISOString(),
to: new Date(Date.now() - 29 * 24 * 60 * 60 * 1000).toISOString()
}
}
});
const ids = oldPdfs.contents.results.map(c => c.id);
await graphlit.deleteContents(ids);
Part 5: Advanced Patterns
Pattern 1: Ingest with Workflow
Apply processing workflows during ingestion:
// Create workflow first (see Workflows guide)
const workflow = await graphlit.createWorkflow({
name: "Extract Entities",
preparation: { /* ... */ },
extraction: { /* ... */ }
});
// Ingest with workflow
const content = await graphlit.ingestUri(
'https://example.com/doc.pdf',
'Document',
undefined,
undefined,
undefined,
{ id: workflow.createWorkflow.id } // Apply workflow
);
// Content will be processed with entity extraction
Pattern 2: Update Content Metadata
// Update content name or metadata
await graphlit.updateContent(contentId, {
name: 'Updated Name'
});
// Query to verify
const updated = await graphlit.getContent(contentId);
console.log('New name:', updated.content.name);
Pattern 3: Re-Ingest Content
// To re-process content with a different workflow:
// 1. Delete old content
await graphlit.deleteContent(oldContentId);
// 2. Re-ingest with new workflow
const newContent = await graphlit.ingestUri(
sameUrl,
'Same Document',
undefined,
undefined,
undefined,
{ id: newWorkflowId }
);
Pattern 4: Content Collections
Organize content into collections:
// Create collection
const collection = await graphlit.createCollection('Research Papers');
const collectionId = collection.createCollection.id;
// Ingest into collection
const content = await graphlit.ingestUri(
'https://example.com/paper.pdf',
'Research Paper',
undefined,
undefined,
[{ id: collectionId }] // Add to collection
);
// Query collection
const papers = await graphlit.queryContents({
filter: {
collections: [{ id: collectionId }]
}
});
Part 6: Error Handling
Handling Failed Ingestion
try {
const content = await graphlit.ingestUri('https://example.com/doc.pdf');
await waitForContent(content.ingestUri.id);
} catch (error: any) {
console.error('Ingestion failed:', error.message);
// Check if content is in FAILED state
const content = await graphlit.getContent(contentId);
if (content.content.state === EntityState.Failed) {
console.log('Error details:', content.content.error);
// Retry with different settings or workflow
const retry = await graphlit.ingestUri(
'https://example.com/doc.pdf',
'Retry',
undefined,
undefined,
undefined,
{ id: alternativeWorkflowId }
);
}
}
Retry with Exponential Backoff
async function ingestWithRetry(uri: string, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const content = await graphlit.ingestUri(uri);
await waitForContent(content.ingestUri.id);
return content.ingestUri.id;
} catch (error) {
if (attempt === maxRetries) throw error;
const delay = Math.pow(2, attempt) * 1000;
console.log(`Attempt ${attempt} failed, retrying in ${delay}ms...`);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
Common Issues & Solutions
Issue: Content Stuck in PROCESSING
Problem: Content never reaches INDEXED state.
Solutions:
- Check for timeout:
const content = await graphlit.getContent(contentId);
if (content.content.state === EntityState.Processing) {
console.log('Still processing after 5 minutes - may be stuck');
// Contact support or delete and re-ingest
}
- Try simpler workflow:
// Ingest without workflow
const simple = await graphlit.ingestUri(uri);
Issue: Large Files Timeout
Problem: Uploading large files (>100MB) fails.
Solution: Use URI ingestion with pre-signed URLs:
// 1. Upload file to S3/Azure with your own bucket
const s3Url = await uploadToS3(file);
// 2. Ingest from S3
const content = await graphlit.ingestUri(s3Url);
Issue: Extracted Text is Gibberish
Problem: PDF text extraction produces nonsense.
Solution: Use vision-based extraction:
import { FilePreparationServiceTypes } from 'graphlit-client/dist/generated/graphql-types';
// Create workflow with vision model
const workflow = await graphlit.createWorkflow({
name: "Vision Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument // Uses GPT-4 Vision
}
}]
}
});
// Ingest with vision workflow
const content = await graphlit.ingestUri(uri, undefined, undefined, undefined, undefined, { id: workflow.createWorkflow.id });
What's Next?
You now understand content ingestion completely. Next steps:
- Set up data connectors to automatically ingest from Slack, Gmail, etc.
- Create workflows to customize processing
- Use collections to organize content
- Implement webhooks for production monitoring
Related guides:
- Data Connectors Guide - Auto-sync from 25+ sources
- Workflows and Processing - Customize extraction
- Building Knowledge Graphs - Extract entities
Complete Example: Production Ingestion Pipeline
import { Graphlit } from 'graphlit-client';
import { ContentTypes, FileTypes, EntityState } from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
interface IngestionResult {
contentId: string;
url: string;
state: string;
duration: number;
error?: string;
}
async function productionIngest(urls: string[], workflowId?: string): Promise<IngestionResult[]> {
const results: IngestionResult[] = [];
// Ingest all in parallel
const ingestions = await Promise.allSettled(
urls.map(url => graphlit.ingestUri(
url,
undefined,
undefined,
undefined,
undefined,
workflowId ? { id: workflowId } : undefined
))
);
// Track successful ingestions
const contentIds: Array<{ url: string; contentId: string; startTime: number }> = [];
ingestions.forEach((result, i) => {
if (result.status === 'fulfilled') {
contentIds.push({
url: urls[i],
contentId: result.value.ingestUri.id,
startTime: Date.now()
});
} else {
results.push({
contentId: '',
url: urls[i],
state: 'FAILED',
duration: 0,
error: result.reason.message
});
}
});
// Wait for all to complete (with timeout)
await Promise.all(
contentIds.map(async ({ url, contentId, startTime }) => {
try {
await waitForContentWithTimeout(contentId, 300000); // 5min timeout
const content = await graphlit.getContent(contentId);
results.push({
contentId,
url,
state: content.content.state,
duration: Date.now() - startTime
});
} catch (error: any) {
results.push({
contentId,
url,
state: 'TIMEOUT',
duration: Date.now() - startTime,
error: error.message
});
}
})
);
return results;
}
// Usage
const urls = [
'https://example.com/doc1.pdf',
'https://example.com/doc2.pdf',
'https://example.com/doc3.pdf'
];
const results = await productionIngest(urls, workflowId);
console.log('Ingestion Results:');
results.forEach(r => {
console.log(`${r.url}: ${r.state} (${r.duration}ms)${r.error ? ` - ${r.error}` : ''}`);
});
const successful = results.filter(r => r.state === EntityState.Indexed).length;
console.log(`\n${successful}/${urls.length} successful`);
Happy ingesting! 📥