Content Ingestion: Files, Text, and URLs

Everything in Graphlit starts with ingestion—bringing content from the outside world into your knowledge base. Whether you're uploading PDFs, scraping web pages, or ingesting text snippets, understanding how ingestion works is critical for building production applications.

This guide covers the three ingestion methods, content vs file types, metadata extraction, lifecycle states, polling strategies, and batch operations. By the end, you'll know exactly how to get any content type into Graphlit efficiently.

What You'll Learn

Three ingestion methods: URI, file upload, and text
Content types vs file types (EMAIL, MESSAGE, FILE, PAGE, etc.)
Metadata extraction by content type
Content lifecycle states (CREATED → PROCESSING → INDEXED)
Polling vs webhooks for completion tracking
Batch ingestion patterns
Error handling and retry strategies

Prerequisites: A Graphlit project, SDK installed.

Developer Note: All Graphlit IDs are GUIDs. Example outputs show realistic GUID format.

Part 1: The Three Ingestion Methods

Method 1: Ingest from URI (Most Common)

Ingest content from any publicly accessible URL:

import { Graphlit } from 'graphlit-client';

const graphlit = new Graphlit();

// Ingest a PDF from URL
const pdf = await graphlit.ingestUri(
  'https://example.com/document.pdf',
  'Annual Report 2024'  // Optional name
);

console.log('Content ID:', pdf.ingestUri.id);
console.log('State:', pdf.ingestUri.state);  // CREATED

// Ingest a web page
const page = await graphlit.ingestUri(
  'https://example.com/blog/ai-trends',
  'AI Trends Article'
);

// Ingest audio/video
const video = await graphlit.ingestUri(
  'https://example.com/webinar.mp4',
  'Product Webinar'
);

Supports:

Documents (PDF, Word, Excel, PowerPoint)
Web pages (HTML)
Images (JPEG, PNG, GIF, etc.)
Audio (MP3, WAV, podcast feeds)
Video (MP4, MOV, etc.)
Archives (ZIP containing files)

Use cases:

Scraping documentation sites
Importing research papers from URLs
Processing podcast RSS feeds
Ingesting social media links

Method 2: Upload File (Direct Upload)

Upload files directly from your app or local filesystem:

import fs from 'fs';

// Read file to base64
const fileBuffer = fs.readFileSync('/path/to/document.pdf');
const fileData = fileBuffer.toString('base64');

// Upload file
const upload = await graphlit.ingestEncodedFile(
  'document.pdf',
  fileData,
  'application/pdf',
  'Uploaded Document'
);

console.log('Uploaded:', upload.ingestEncodedFile.id);

Browser example (React):

async function handleFileUpload(file: File) {
  // Convert file to base64
  const arrayBuffer = await file.arrayBuffer();
  const base64 = btoa(String.fromCharCode(...new Uint8Array(arrayBuffer)));
  
  // Upload to Graphlit
  const result = await graphlit.ingestEncodedFile(
    file.name,
    base64,
    file.type,
    file.name
  );
  
  return result.ingestEncodedFile.id;
}

// In your component
<input
  type="file"
  onChange={async (e) => {
    const file = e.target.files?.[0];
    if (file) {
      const contentId = await handleFileUpload(file);
      console.log('Uploaded:', contentId);
    }
  }}
/>

Use cases:

User file uploads
Document management systems
Local file processing

Method 3: Ingest Text (Plain Text/Markdown)

Ingest raw text or markdown without a file:

// Ingest plain text
const text = await graphlit.ingestText(
  'This is my note about quantum computing...',
  'Quantum Computing Notes',
  false  // isMarkdown
);

// Ingest markdown
const markdown = await graphlit.ingestText(
  '# Project Plan\n\n## Phase 1\n- Task 1\n- Task 2',
  'Project Plan',
  true  // isMarkdown
);

console.log('Text content ID:', text.ingestText.id);

Use cases:

Note-taking apps
User-generated content
Chat messages
Code snippets

Part 2: Content Types vs File Types

Understanding the difference is critical for filtering and metadata access.

Content Types (Primary Classification)

What it represents (semantic meaning):

import { ContentTypes } from 'graphlit-client/dist/generated/graphql-types';

// Content types:
ContentTypes.Email      // Email messages
ContentTypes.Message    // Slack, Teams, Discord messages
ContentTypes.Page       // Web pages
ContentTypes.File       // Physical files (PDFs, images, videos, etc.)
ContentTypes.Post       // Social media posts, RSS
ContentTypes.Event      // Calendar events
ContentTypes.Issue      // Jira, Linear, GitHub issues
ContentTypes.Text       // Plain text, markdown

File Types (Secondary Classification)

Physical format (only when contentType = FILE):

import { FileTypes } from 'graphlit-client/dist/generated/graphql-types';

// File types:
FileTypes.Document  // PDF, DOCX, XLSX, PPTX
FileTypes.Image     // JPEG, PNG, GIF, TIFF
FileTypes.Audio     // MP3, WAV, M4A
FileTypes.Video     // MP4, MOV, AVI
FileTypes.Code      // Python, JavaScript, etc.
FileTypes.Data      // JSON, XML, CSV
FileTypes.Package   // ZIP, TAR

The Hierarchy

Content
  └─ type: ContentTypes (always present)
       └─ fileType: FileTypes (only if type = FILE)

Examples:

// PDF document
content.type = ContentTypes.File
content.fileType = FileTypes.Document

// Email with PDF attachment
content.type = ContentTypes.Email
content.fileType = null  // Emails don't have fileType

// Slack message with image
content.type = ContentTypes.Message
content.fileType = FileTypes.Image  // The image attached

// Web page
content.type = ContentTypes.Page
content.fileType = null

Filtering by Type

// Get all emails
const emails = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Email]
  }
});

// Get all PDF documents
const pdfs = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.File],
    fileTypes: [FileTypes.Document]
  }
});

// Get all images (from any source)
const images = await graphlit.queryContents({
  filter: {
    fileTypes: [FileTypes.Image]
  }
});

Type-Specific Metadata

Each content type has a metadata field:

const content = await graphlit.getContent(contentId);

switch (content.content.type) {
  case ContentTypes.Email:
    // Access email metadata
    console.log('From:', content.content.email.from[0].email);
    console.log('Subject:', content.content.email.subject);
    console.log('Sent:', content.content.email.sentDateTime);
    break;
    
  case ContentTypes.Message:
    // Access message metadata
    console.log('Channel:', content.content.message.channelName);
    console.log('Author:', content.content.message.author);
    console.log('Platform:', content.content.message.platform);  // SLACK, TEAMS, etc.
    break;
    
  case ContentTypes.File:
    if (content.content.fileType === FileTypes.Document) {
      // Access document metadata
      console.log('Pages:', content.content.document.pageCount);
      console.log('Author:', content.content.document.author);
      console.log('Title:', content.content.document.title);
    }
    break;
    
  case ContentTypes.Event:
    // Access calendar event metadata
    console.log('Start:', content.content.event.startDateTime);
    console.log('Attendees:', content.content.event.attendees.length);
    break;
}

Part 3: Content Lifecycle States

Content goes through states during processing:

The State Machine

CREATED → PROCESSING → INDEXED
                ↓
            (if error) → FAILED

States:

CREATED: Just ingested, waiting to process
PROCESSING: Extracting text, generating embeddings, running workflows
INDEXED: Fully processed, searchable
FAILED: Processing failed (check error message)

Checking State

const content = await graphlit.getContent(contentId);
console.log('State:', content.content.state);

// States from enum
import { EntityState } from 'graphlit-client/dist/generated/graphql-types';

if (content.content.state === EntityState.Indexed) {
  console.log('✓ Content is ready for search');
}

Polling for Completion

// Method 1: Simple polling
async function waitForContent(contentId: string) {
  let isDone = false;
  
  while (!isDone) {
    const status = await graphlit.isContentDone(contentId);
    isDone = status.isContentDone.result;
    
    if (!isDone) {
      console.log('Still processing...');
      await new Promise(resolve => setTimeout(resolve, 2000));  // 2s delay
    }
  }
  
  console.log('✓ Content ready!');
}

// Usage
const content = await graphlit.ingestUri('https://example.com/doc.pdf');
await waitForContent(content.ingestUri.id);

With timeout:

async function waitForContentWithTimeout(contentId: string, timeoutMs = 60000) {
  const startTime = Date.now();
  
  while (Date.now() - startTime < timeoutMs) {
    const status = await graphlit.isContentDone(contentId);
    
    if (status.isContentDone.result) {
      return true;  // Success
    }
    
    await new Promise(resolve => setTimeout(resolve, 2000));
  }
  
  throw new Error('Timeout waiting for content');
}

Webhooks (Production Pattern)

Don't poll in production—use webhooks:

// When creating a feed or workflow, specify webhook URL
const feed = await graphlit.createFeed(
  FeedServiceTypes.Rss,
  workflowId,
  { uri: 'https://example.com/rss' },
  'My Feed',
  undefined,
  'https://yourapp.com/webhooks/graphlit'  // Your webhook endpoint
);

// Your webhook handler (Express.js)
app.post('/webhooks/graphlit', async (req, res) => {
  const event = req.body;
  
  if (event.type === 'content.done') {
    console.log('Content ready:', event.contentId);
    
    // Fetch full content
    const content = await graphlit.getContent(event.contentId);
    
    // Process it (index in your DB, trigger notifications, etc.)
    await processContent(content);
  }
  
  res.sendStatus(200);
});

Part 4: Batch Operations

Batch Ingestion

// Ingest multiple URLs
const urls = [
  'https://example.com/doc1.pdf',
  'https://example.com/doc2.pdf',
  'https://example.com/doc3.pdf'
];

const contentIds: string[] = [];

for (const url of urls) {
  const result = await graphlit.ingestUri(url);
  contentIds.push(result.ingestUri.id);
  console.log(`Ingested: ${url}`);
}

// Wait for all to complete
await Promise.all(
  contentIds.map(id => waitForContent(id))
);

console.log('✓ All content indexed');

Parallel ingestion:

// Ingest in parallel (faster)
const results = await Promise.all(
  urls.map(url => graphlit.ingestUri(url))
);

const contentIds = results.map(r => r.ingestUri.id);

Batch Deletion

import { EntityState } from 'graphlit-client/dist/generated/graphlit-types';

// Get all content
const allContent = await graphlit.queryContents({
  filter: {
    states: [EntityState.Indexed, EntityState.Failed]
  }
});

// Delete in batches
const contentIds = allContent.contents.results.map(c => c.id);

const deleted = await graphlit.deleteContents(contentIds);
console.log(`Deleted ${deleted.deleteContents.count} content items`);

Delete with filter:

// Delete all PDFs from last month
const oldPdfs = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.File],
    fileTypes: [FileTypes.Document],
    creationDateRange: {
      from: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).toISOString(),
      to: new Date(Date.now() - 29 * 24 * 60 * 60 * 1000).toISOString()
    }
  }
});

const ids = oldPdfs.contents.results.map(c => c.id);
await graphlit.deleteContents(ids);

Part 5: Advanced Patterns

Pattern 1: Ingest with Workflow

Apply processing workflows during ingestion:

// Create workflow first (see Workflows guide)
const workflow = await graphlit.createWorkflow({
  name: "Extract Entities",
  preparation: { /* ... */ },
  extraction: { /* ... */ }
});

// Ingest with workflow
const content = await graphlit.ingestUri(
  'https://example.com/doc.pdf',
  'Document',
  undefined,
  undefined,
  undefined,
  { id: workflow.createWorkflow.id }  // Apply workflow
);

// Content will be processed with entity extraction

Pattern 2: Update Content Metadata

// Update content name or metadata
await graphlit.updateContent(contentId, {
  name: 'Updated Name'
});

// Query to verify
const updated = await graphlit.getContent(contentId);
console.log('New name:', updated.content.name);

Pattern 3: Re-Ingest Content

// To re-process content with a different workflow:

// 1. Delete old content
await graphlit.deleteContent(oldContentId);

// 2. Re-ingest with new workflow
const newContent = await graphlit.ingestUri(
  sameUrl,
  'Same Document',
  undefined,
  undefined,
  undefined,
  { id: newWorkflowId }
);

Pattern 4: Content Collections

Organize content into collections:

// Create collection
const collection = await graphlit.createCollection('Research Papers');
const collectionId = collection.createCollection.id;

// Ingest into collection
const content = await graphlit.ingestUri(
  'https://example.com/paper.pdf',
  'Research Paper',
  undefined,
  undefined,
  [{ id: collectionId }]  // Add to collection
);

// Query collection
const papers = await graphlit.queryContents({
  filter: {
    collections: [{ id: collectionId }]
  }
});

Part 6: Error Handling

Handling Failed Ingestion

try {
  const content = await graphlit.ingestUri('https://example.com/doc.pdf');
  await waitForContent(content.ingestUri.id);
} catch (error: any) {
  console.error('Ingestion failed:', error.message);
  
  // Check if content is in FAILED state
  const content = await graphlit.getContent(contentId);
  
  if (content.content.state === EntityState.Failed) {
    console.log('Error details:', content.content.error);
    
    // Retry with different settings or workflow
    const retry = await graphlit.ingestUri(
      'https://example.com/doc.pdf',
      'Retry',
      undefined,
      undefined,
      undefined,
      { id: alternativeWorkflowId }
    );
  }
}

Retry with Exponential Backoff

async function ingestWithRetry(uri: string, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const content = await graphlit.ingestUri(uri);
      await waitForContent(content.ingestUri.id);
      return content.ingestUri.id;
    } catch (error) {
      if (attempt === maxRetries) throw error;
      
      const delay = Math.pow(2, attempt) * 1000;
      console.log(`Attempt ${attempt} failed, retrying in ${delay}ms...`);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

Common Issues & Solutions

Issue: Content Stuck in PROCESSING

Problem: Content never reaches INDEXED state.

Solutions:

Check for timeout:

const content = await graphlit.getContent(contentId);
if (content.content.state === EntityState.Processing) {
  console.log('Still processing after 5 minutes - may be stuck');
  // Contact support or delete and re-ingest
}

Try simpler workflow:

// Ingest without workflow
const simple = await graphlit.ingestUri(uri);

Issue: Large Files Timeout

Problem: Uploading large files (>100MB) fails.

Solution: Use URI ingestion with pre-signed URLs:

// 1. Upload file to S3/Azure with your own bucket
const s3Url = await uploadToS3(file);

// 2. Ingest from S3
const content = await graphlit.ingestUri(s3Url);

Issue: Extracted Text is Gibberish

Problem: PDF text extraction produces nonsense.

Solution: Use vision-based extraction:

import { FilePreparationServiceTypes } from 'graphlit-client/dist/generated/graphql-types';

// Create workflow with vision model
const workflow = await graphlit.createWorkflow({
  name: "Vision Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // Uses GPT-4 Vision
      }
    }]
  }
});

// Ingest with vision workflow
const content = await graphlit.ingestUri(uri, undefined, undefined, undefined, undefined, { id: workflow.createWorkflow.id });

What's Next?

You now understand content ingestion completely. Next steps:

Set up data connectors to automatically ingest from Slack, Gmail, etc.
Create workflows to customize processing
Use collections to organize content
Implement webhooks for production monitoring

Related guides:

Data Connectors Guide - Auto-sync from 25+ sources
Workflows and Processing - Customize extraction
Building Knowledge Graphs - Extract entities

Complete Example: Production Ingestion Pipeline

import { Graphlit } from 'graphlit-client';
import { ContentTypes, FileTypes, EntityState } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

interface IngestionResult {
  contentId: string;
  url: string;
  state: string;
  duration: number;
  error?: string;
}

async function productionIngest(urls: string[], workflowId?: string): Promise<IngestionResult[]> {
  const results: IngestionResult[] = [];
  
  // Ingest all in parallel
  const ingestions = await Promise.allSettled(
    urls.map(url => graphlit.ingestUri(
      url,
      undefined,
      undefined,
      undefined,
      undefined,
      workflowId ? { id: workflowId } : undefined
    ))
  );
  
  // Track successful ingestions
  const contentIds: Array<{ url: string; contentId: string; startTime: number }> = [];
  
  ingestions.forEach((result, i) => {
    if (result.status === 'fulfilled') {
      contentIds.push({
        url: urls[i],
        contentId: result.value.ingestUri.id,
        startTime: Date.now()
      });
    } else {
      results.push({
        contentId: '',
        url: urls[i],
        state: 'FAILED',
        duration: 0,
        error: result.reason.message
      });
    }
  });
  
  // Wait for all to complete (with timeout)
  await Promise.all(
    contentIds.map(async ({ url, contentId, startTime }) => {
      try {
        await waitForContentWithTimeout(contentId, 300000);  // 5min timeout
        
        const content = await graphlit.getContent(contentId);
        results.push({
          contentId,
          url,
          state: content.content.state,
          duration: Date.now() - startTime
        });
      } catch (error: any) {
        results.push({
          contentId,
          url,
          state: 'TIMEOUT',
          duration: Date.now() - startTime,
          error: error.message
        });
      }
    })
  );
  
  return results;
}

// Usage
const urls = [
  'https://example.com/doc1.pdf',
  'https://example.com/doc2.pdf',
  'https://example.com/doc3.pdf'
];

const results = await productionIngest(urls, workflowId);

console.log('Ingestion Results:');
results.forEach(r => {
  console.log(`${r.url}: ${r.state} (${r.duration}ms)${r.error ? ` - ${r.error}` : ''}`);
});

const successful = results.filter(r => r.state === EntityState.Indexed).length;
console.log(`\n${successful}/${urls.length} successful`);

Happy ingesting! 📥

Content Ingestion: Files, Text, and URLs

What You'll Learn

Part 1: The Three Ingestion Methods

Method 1: Ingest from URI (Most Common)

Method 2: Upload File (Direct Upload)

Method 3: Ingest Text (Plain Text/Markdown)

Part 2: Content Types vs File Types

Content Types (Primary Classification)

File Types (Secondary Classification)

The Hierarchy

Filtering by Type

Type-Specific Metadata

Part 3: Content Lifecycle States

The State Machine

Checking State

Polling for Completion

Webhooks (Production Pattern)

Part 4: Batch Operations

Batch Ingestion

Batch Deletion

Part 5: Advanced Patterns

Pattern 1: Ingest with Workflow

Pattern 2: Update Content Metadata

Pattern 3: Re-Ingest Content

Pattern 4: Content Collections

Part 6: Error Handling

Handling Failed Ingestion

Retry with Exponential Backoff

Common Issues & Solutions

Issue: Content Stuck in PROCESSING

Issue: Large Files Timeout

Issue: Extracted Text is Gibberish

What's Next?

Complete Example: Production Ingestion Pipeline

Ready to Build with Graphlit?