Core Platform25 min read

Building Knowledge Graphs: From Zero to Production

Learn how to extract entities from documents, build knowledge graphs, and query relationships. Complete guide covering PDF extraction, the Observable model, entity types, and production patterns.

Knowledge graphs transform unstructured content—PDFs, emails, meeting transcripts, Slack messages—into queryable networks of entities and relationships. Instead of searching for keywords, you can ask "Who works at which company?" or "What topics were discussed in meetings about Project Apollo?"

This guide takes you from your first entity extraction to production knowledge graphs serving real applications. We'll cover the Observable/Observation model, entity types, extraction workflows, and advanced querying patterns—with complete code examples.

What You'll Build

By the end of this guide, you'll know how to:

  • Extract entities (people, companies, places) from any content type
  • Configure extraction workflows for your use case
  • Understand the Observable/Observation architecture
  • Query entities and relationships
  • Build entity-filtered search and RAG systems
  • Handle deduplication, confidence scores, and edge cases
  • Scale to production with multi-content knowledge graphs

Prerequisites:

  • A Graphlit project (free tier works) - Sign up (2 min)
  • SDK installed: npm install graphlit-client (30 sec)
  • Some content to process (we'll use a public PDF)

Time to complete: 90 minutes
Difficulty: Intermediate

Developer Note: All Graphlit IDs are GUIDs (e.g., 550e8400-e29b-41d4-a716-446655440000). In code examples below, we use short placeholders like content-123 for readability where they're variables that would be populated at runtime. Example outputs show realistic GUID format.


Table of Contents

  1. Your First Knowledge Graph (15 min)
  2. Understanding Observable/Observation Model
  3. Entity Types and Extraction Strategies
  4. Querying Your Knowledge Graph
  5. Different Content Types
  6. Production Patterns
  7. Advanced Querying

Part 1: Your First Knowledge Graph (15 minutes)

Let's start with the simplest possible example: extract entities from a single PDF.

Step 1: Create an Extraction Workflow

Workflows tell Graphlit how to process content. For knowledge graphs, you need an extraction stage:

import { Graphlit } from 'graphlit-client';
import {
  FilePreparationServiceTypes,
  EntityExtractionServiceTypes,
  ObservableTypes
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// Create workflow with entity extraction
const workflow = await graphlit.createWorkflow({
  name: "PDF Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // Uses vision model for PDFs
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

console.log(`✓ Workflow created: ${workflow.createWorkflow.id}`);

What's happening:

  • preparation stage: Vision model extracts text from PDF pages (handles scans, tables, multi-column layouts)
  • extraction stage: LLM reads extracted text and identifies entities
  • extractedTypes: Only extract these 4 entity types (there are 12 medical types and 7 general types available)

💡 Pro Tip: Use FilePreparationServiceTypes.ModelDocument (vision model) for scanned PDFs—3x more accurate than OCR but costs 2x more. For text-based PDFs, use FilePreparationServiceTypes.Document (OCR) to save costs.

Step 2: Ingest Content with the Workflow

// Ingest a research paper
const content = await graphlit.ingestUri(
  'https://arxiv.org/pdf/2301.00001.pdf',
  "AI Research Paper",
  undefined,
  undefined,
  undefined,
  { id: workflow.createWorkflow.id }  // Use our extraction workflow
);

console.log(`✓ Ingesting: ${content.ingestUri.id}`);

// Wait for processing to complete
let isDone = false;
while (!isDone) {
  const status = await graphlit.isContentDone(content.ingestUri.id);
  isDone = status.isContentDone.result;
  
  if (!isDone) {
    await new Promise(resolve => setTimeout(resolve, 2000));
  }
}

console.log('✓ Extraction complete!');

Developer hint: isContentDone polls processing status. For production, use webhooks instead.

⚠️ Warning: Polling in tight loops will hit rate limits. Add 2-5 second delays between checks, or better yet, use webhooks for production apps.

Step 3: Retrieve Extracted Entities

// Get content with observations (entity mentions)
const contentDetails = await graphlit.getContent(content.ingestUri.id);
const observations = contentDetails.content.observations || [];

console.log(`Found ${observations.length} entity observations`);

// Group by entity type
const byType = new Map<string, Set<string>>();

observations.forEach(obs => {
  if (!byType.has(obs.type)) {
    byType.set(obs.type, new Set());
  }
  byType.get(obs.type)!.add(obs.observable.name);
});

// Display results
byType.forEach((entities, type) => {
  console.log(`\n${type} (${entities.size} unique):`);
  Array.from(entities).slice(0, 5).forEach(name => {
    console.log(`  - ${name}`);
  });
});

Example output:

PERSON (23 unique):
  - Geoffrey Hinton
  - Yann LeCun
  - Yoshua Bengio
  - Andrew Ng
  - Fei-Fei Li

ORGANIZATION (12 unique):
  - Google
  - OpenAI
  - Stanford University
  - MIT
  - DeepMind

🎉 Congratulations! You just built your first knowledge graph. But this is just the beginning—let's understand what's actually happening under the hood.


Part 2: Understanding the Observable/Observation Model

This is the most important concept in Graphlit knowledge graphs. Get this, and everything else makes sense.

The Two-Tier Architecture

Observable = The entity itself (e.g., the person "Geoffrey Hinton")
Observation = A specific mention of that entity in content

Why two layers?

  1. Deduplication: "Geoffrey Hinton" mentioned 50 times across 10 PDFs = 1 Observable, 50 Observations
  2. Provenance: Track exactly where each mention appears (page 3, paragraph 2, coordinates)
  3. Confidence: Each mention has its own confidence score
  4. Relationships: Find co-occurrences (who was mentioned with whom, on which pages)

Data Flow Diagram

PDF Ingestion
    ↓
Vision Model Extracts Text (Preparation)
    ↓
LLM Identifies Entities (Extraction)
    ↓
For Each Entity Mention:
  ├─ Create Observation
  │  ├─ Type (PERSON, ORGANIZATION, etc.)
  │  ├─ Confidence (0.0-1.0)
  │  ├─ Page number & coordinates
  │  └─ Text context
  ↓
Entity Resolution (Automatic Deduplication)
  ├─ Is this entity already in the graph?
  ├─ Match by name, properties
  └─ Create new Observable OR link to existing
  ↓
Knowledge Graph Updated
  └─ Observable now has N observations

Code Example: Observations vs Observables

// Get observations from a single content item
const content = await graphlit.getContent('content-123');

content.content.observations?.forEach(observation => {
  console.log(`\nObservation ID: ${observation.id}`);
  console.log(`  Entity Type: ${observation.type}`);
  console.log(`  Entity Name: ${observation.observable.name}`);
  console.log(`  Observable ID: ${observation.observable.id}`);  // The entity itself
  
  // Where was this entity mentioned?
  observation.occurrences?.forEach(occ => {
    console.log(`    Page ${occ.pageIndex}, confidence: ${occ.confidence}`);
    if (occ.boundingBox) {
      console.log(`    Location: (${occ.boundingBox.left}, ${occ.boundingBox.top})`);
    }
  });
});

Example output:

Observation ID: 8c3e2f1a-4b5d-6e7f-8g9h-0i1j2k3l4m5n
  Entity Type: PERSON
  Entity Name: Geoffrey Hinton
  Observable ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890  ← The actual entity
    Page 3, confidence: 0.95
    Location: (120, 450)
    Page 12, confidence: 0.89
    Location: (340, 200)

Now query the Observable (entity) directly:

// Get the entity itself with all its mentions across ALL content
const entityResult = await graphlit.queryObservables({
  observables: [{ id: 'a1b2c3d4-e5f6-7890-abcd-ef1234567890' }]
});

const entity = entityResult.observables?.results?.[0];

console.log(`Entity: ${entity?.observable.name}`);
console.log(`Type: ${entity?.type}`);
console.log(`Mentioned in ${entity?.observable.observationCount} places`);

Key insight: Observations are scoped to single content items. Observables span your entire knowledge graph.

✅ Quick Win: Once you understand this model, you can build entity-filtered search and RAG chatbots that answer questions like "What did Alice say about Project Phoenix?"


Part 3: Entity Types and Extraction Strategies

Graphlit supports 19 built-in entity types across two categories:

General Entity Types (7)

import { ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';

const generalTypes = [
  ObservableTypes.Person,           // People, authors, speakers
  ObservableTypes.Organization,     // Companies, institutions, agencies
  ObservableTypes.Place,            // Cities, countries, addresses
  ObservableTypes.Event,            // Meetings, conferences, incidents
  ObservableTypes.Product,          // Software, devices, offerings
  ObservableTypes.CreativeWork,     // Books, papers, articles
  ObservableTypes.Other             // Catch-all for domain-specific entities
];

Medical Entity Types (12)

For healthcare, research, and clinical applications:

const medicalTypes = [
  ObservableTypes.MedicalCondition,      // Diseases, symptoms, diagnoses
  ObservableTypes.MedicalProcedure,      // Surgeries, treatments, therapies
  ObservableTypes.MedicalTest,           // Labs, imaging, diagnostics
  ObservableTypes.MedicalTreatment,      // Drugs, protocols, interventions
  ObservableTypes.MedicalAnatomy,        // Organs, body parts, systems
  ObservableTypes.MedicalDevice,         // Equipment, implants, instruments
  ObservableTypes.MedicalGuideline,      // Protocols, standards, best practices
  ObservableTypes.MedicalStudy,          // Clinical trials, research papers
  ObservableTypes.MedicalMeasurement,    // Vital signs, lab values, metrics
  ObservableTypes.MedicalCode,           // ICD-10, CPT codes
  ObservableTypes.MedicalQuality,        // Severity descriptors
  ObservableTypes.MedicalDrug            // Pharmaceuticals, medications
];

Choosing Entity Types: Decision Matrix

Use CaseEntity TypesWhy
Corporate docs (emails, reports)Person, Organization, Place, EventTrack who works where, meeting attendees
Research papersPerson, Organization, CreativeWorkAuthors, citations, institutions
Customer supportPerson, Organization, Product, EventCustomer names, companies, products discussed, incidents
Medical recordsAll medical types + Person, PlacePatient data, treatments, locations
Legal documentsPerson, Organization, Place, EventParties, companies, jurisdictions, dates
News/Social mediaPerson, Organization, Place, Event, ProductWho, what, where, when

Configuring Multi-Type Extraction

// Comprehensive entity extraction workflow
const comprehensiveWorkflow = await graphlit.createWorkflow({
  name: "Comprehensive Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument
      }
    }]
  },
  extraction: {
    jobs: [
      // First pass: General entities
      {
        connector: {
          type: EntityExtractionServiceTypes.ModelDocument,
          extractedTypes: [
            ObservableTypes.Person,
            ObservableTypes.Organization,
            ObservableTypes.Place,
            ObservableTypes.Event,
            ObservableTypes.Product,
            ObservableTypes.CreativeWork
          ]
        }
      },
      // Second pass: Medical entities (if applicable)
      {
        connector: {
          type: EntityExtractionServiceTypes.ModelDocument,
          extractedTypes: [
            ObservableTypes.MedicalCondition,
            ObservableTypes.MedicalProcedure,
            ObservableTypes.MedicalTreatment,
            ObservableTypes.MedicalDrug
          ]
        }
      }
    ]
  }
});

Developer hint: Multiple extraction jobs run in parallel. Use this for medical + general extraction or different models per entity type.


Part 4: Querying Your Knowledge Graph

Now that you have entities, let's query them like a graph database.

Basic Queries: Get All Entities of a Type

import { EntityState } from 'graphlit-client/dist/generated/graphql-types';

// Get all people in your knowledge graph
const people = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Person],
    states: [EntityState.Enabled]  // Exclude disabled/deleted entities
  }
});

console.log(`Total people: ${people.observables?.results?.length}`);

people.observables?.results?.forEach(result => {
  const person = result.observable;
  console.log(`- ${person.name} (mentioned ${person.observationCount} times)`);
});

Filtering by Name (Search Entities)

// Find all organizations with "Research" in the name
const researchOrgs = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Organization],
    searchText: "Research",  // Fuzzy name matching
    states: [EntityState.Enabled]
  }
});

researchOrgs.observables?.results?.forEach(result => {
  console.log(result.observable.name);
});
// Output: "OpenAI Research", "Google Research", "MIT Research Lab", etc.

Advanced: Find Entity Relationships (Co-occurrence)

Entities that appear on the same page are likely related. Let's find person-organization relationships:

// Get content with observations
const content = await graphlit.getContent('content-123');
const observations = content.content.observations || [];

// Build co-occurrence matrix
const relationships: Array<{
  person: string;
  organization: string;
  pages: number[];
}> = [];

observations
  .filter(obs => obs.type === ObservableTypes.Person)
  .forEach(personObs => {
    const personPages = new Set(
      personObs.occurrences?.map(occ => occ.pageIndex) || []
    );
    
    observations
      .filter(obs => obs.type === ObservableTypes.Organization)
      .forEach(orgObs => {
        const orgPages = new Set(
          orgObs.occurrences?.map(occ => occ.pageIndex) || []
        );
        
        // Find shared pages
        const sharedPages = Array.from(personPages).filter(p => orgPages.has(p));
        
        if (sharedPages.length > 0) {
          relationships.push({
            person: personObs.observable.name,
            organization: orgObs.observable.name,
            pages: sharedPages
          });
        }
      });
  });

// Display top relationships
relationships
  .sort((a, b) => b.pages.length - a.pages.length)
  .slice(0, 10)
  .forEach(rel => {
    console.log(`${rel.person} ↔ ${rel.organization}`);
    console.log(`  Co-occurs on ${rel.pages.length} pages: ${rel.pages.join(', ')}`);
  });

Example output:

Geoffrey Hinton ↔ Google
  Co-occurs on 8 pages: 3, 7, 12, 15, 18, 23, 29, 31

Yann LeCun ↔ Meta
  Co-occurs on 5 pages: 4, 9, 14, 20, 27

Entity-Filtered Search (RAG)

Use entities to filter search results—find content mentioning specific people or companies:

// Search for content mentioning both "Geoffrey Hinton" AND "neural networks"
const searchResults = await graphlit.searchContents(
  'neural networks',
  {
    filters: [
      {
        observations: {
          observables: [
            { id: 'entity-person-123' }  // Geoffrey Hinton's entity ID
          ]
        }
      }
    ]
  }
);

searchResults.results?.forEach(result => {
  console.log(`${result.name} - Score: ${result.score}`);
});

Use case: "Show me all emails where Alice mentioned Project Phoenix" or "Find meeting transcripts with Bob and the CFO".


Part 5: Building Knowledge Graphs from Different Content Types

The workflow pattern stays the same, but extraction strategies differ by content type.

Emails (Gmail, Outlook)

// Create email-optimized workflow
const emailWorkflow = await graphlit.createWorkflow({
  name: "Email Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Email  // Email-specific parsing
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,  // Emails are pure text
        extractedTypes: [
          ObservableTypes.Person,        // Senders, recipients, mentioned people
          ObservableTypes.Organization,  // Companies, clients
          ObservableTypes.Place,         // Office locations, meeting venues
          ObservableTypes.Event          // Meetings, deadlines
        ]
      }
    }]
  }
});

// Create Gmail feed with workflow
const feed = await graphlit.createFeed(
  FeedServiceTypes.Gmail,
  { id: emailWorkflow.createWorkflow.id },
  { readLimit: 100 },  // Last 100 emails
  'Gmail Entity Extraction'
);

What you get: Automatic contact extraction, company mentions, meeting locations—queryable as a knowledge graph.

Slack Messages

// Slack-optimized workflow
const slackWorkflow = await graphlit.createWorkflow({
  name: "Slack Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Message
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Product,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

// Create Slack feed
const slackFeed = await graphlit.createFeed(
  FeedServiceTypes.Slack,
  { id: slackWorkflow.createWorkflow.id },
  {
    type: FeedTypeTypes.Channel,
    channels: ['general', 'engineering', 'product']
  },
  'Slack Channel Entities'
);

Use case: "Who's talking about which products in Slack?" or "Which customers were mentioned in #support today?"

Meeting Transcripts

// Meeting/audio workflow
const meetingWorkflow = await graphlit.createWorkflow({
  name: "Meeting Transcript Entities",
  preparation: {
    jobs: [
      {
        connector: {
          type: FilePreparationServiceTypes.ModelAudio  // Transcribe audio
        }
      }
    ]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,   // Attendees, mentioned people
          ObservableTypes.Event,    // Mentioned meetings, deadlines
          ObservableTypes.Product,  // Products discussed
          ObservableTypes.Place     // Office locations, cities
        ]
      }
    }]
  }
});

Developer hint: Transcription happens in the preparation stage. Extraction runs on the transcript text.


Part 6: Production Patterns

Pattern 1: Multi-Content Knowledge Graphs

Build a unified knowledge graph across all your content:

// Create workflow once
const productionWorkflow = await graphlit.createWorkflow({
  name: "Production Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

// Ingest multiple content items with same workflow
const contents = [
  'https://company.com/q4-report.pdf',
  'https://company.com/strategy.pdf',
  'https://company.com/org-chart.pdf'
];

for (const url of contents) {
  await graphlit.ingestUri(
    url,
    undefined,
    undefined,
    undefined,
    undefined,
    { id: productionWorkflow.createWorkflow.id }
  );
}

// Now query entities across ALL ingested content
const allPeople = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Person],
    states: [EntityState.Enabled]
  }
});

console.log(`Total people across all documents: ${allPeople.observables?.results?.length}`);

Key insight: Entities are automatically deduplicated. "Alice Johnson" mentioned in 3 different PDFs = 1 Observable with 3+ Observations.

Pattern 2: Confidence-Based Filtering

Not all entity mentions are equally reliable. Filter by confidence:

// Get high-confidence observations only
const content = await graphlit.getContent('content-123');

const highConfidenceEntities = content.content.observations
  ?.filter(obs => {
    const avgConfidence = obs.occurrences
      ?.reduce((sum, occ) => sum + (occ.confidence || 0), 0) 
      / (obs.occurrences?.length || 1);
    return avgConfidence >= 0.8;  // 80%+ confidence
  })
  .map(obs => ({
    name: obs.observable.name,
    type: obs.type,
    confidence: obs.occurrences
      ?.reduce((sum, occ) => sum + (occ.confidence || 0), 0)
      / (obs.occurrences?.length || 1)
  }));

console.log('High-confidence entities:', highConfidenceEntities);

Rule of thumb:

  • 0.9+: Very reliable (use in production UIs)
  • 0.7-0.9: Reliable (good for most use cases)
  • 0.5-0.7: Moderate (review manually)
  • <0.5: Low confidence (likely false positives)

Pattern 3: Webhooks for Real-Time Processing

Don't poll isContentDone—use webhooks:

// Configure webhook when creating feed or workflow
const feed = await graphlit.createFeed(
  FeedServiceTypes.Gmail,
  { id: workflow.createWorkflow.id },
  { readLimit: 100 },
  'Gmail with Webhooks',
  undefined,
  'https://yourapp.com/webhooks/graphlit'  // Your webhook endpoint
);

// Your webhook handler (Express.js example)
app.post('/webhooks/graphlit', async (req, res) => {
  const event = req.body;
  
  if (event.type === 'content.done') {
    const contentId = event.contentId;
    
    // Retrieve entities
    const content = await graphlit.getContent(contentId);
    const entities = content.content.observations || [];
    
    // Store in your database, trigger notifications, etc.
    await storeEntities(entities);
    
    console.log(`Processed ${entities.length} entities from ${event.contentId}`);
  }
  
  res.sendStatus(200);
});

Pattern 4: Entity Deduplication Strategies

Graphlit deduplicates automatically by name, but you can improve matching:

// Query entity with properties to help deduplication
const entity = await graphlit.getObservable('entity-person-123');

console.log('Entity properties:', {
  name: entity.observable?.name,
  email: entity.observable?.properties?.email,
  affiliation: entity.observable?.properties?.affiliation,
  alternateNames: entity.observable?.alternateNames
});

// Entities with same name + same email = deduplicated automatically
// "Kirk Marple" (kirk@graphlit.com) in PDF + "Kirk Marple" (kirk@graphlit.com) in email = 1 Observable

Part 7: Advanced Querying & Graph Traversal

Query Entities with Related Content

Find all content mentioning a specific person:

// Get all content where "Alice Johnson" is mentioned
const aliceEntity = await graphlit.queryObservables({
  filter: {
    searchText: "Alice Johnson",
    types: [ObservableTypes.Person]
  }
});

const aliceId = aliceEntity.observables?.results?.[0]?.observable.id;

// Search content filtered by this entity
const relatedContent = await graphlit.searchContents('', {
  filters: [{
    observations: {
      observables: [{ id: aliceId }]
    }
  }]
});

console.log(`Content mentioning Alice Johnson: ${relatedContent.results?.length}`);
relatedContent.results?.forEach(content => {
  console.log(`- ${content.name}`);
});

Build Entity Timeline (Chronological Mentions)

// Get all observations of an entity, sorted by content date
const entityTimeline = await graphlit.queryObservables({
  observables: [{ id: 'entity-person-123' }]
});

// Fetch each content item to get dates
const timeline = await Promise.all(
  entityTimeline.observables?.results?.[0]?.observable.observations?.map(async obs => {
    const content = await graphlit.getContent(obs.contentId);
    return {
      contentName: content.content.name,
      date: content.content.finishedDate || content.content.creationDate,
      pages: obs.occurrences?.map(occ => occ.pageIndex)
    };
  }) || []
);

// Sort by date
timeline
  .sort((a, b) => new Date(a.date).getTime() - new Date(b.date).getTime())
  .forEach(item => {
    console.log(`${item.date}: ${item.contentName} (pages ${item.pages?.join(', ')})`);
  });

Use case: "Show me the history of mentions for Project Phoenix across all docs, chronologically."

Entity Co-Occurrence Network (Graph Visualization Data)

// Build person-person co-occurrence network
interface PersonRelationship {
  person1: string;
  person2: string;
  strength: number;  // Number of shared pages
}

const network: PersonRelationship[] = [];

const content = await graphlit.getContent('content-123');
const personObservations = content.content.observations
  ?.filter(obs => obs.type === ObservableTypes.Person) || [];

// For each pair of people
for (let i = 0; i < personObservations.length; i++) {
  for (let j = i + 1; j < personObservations.length; j++) {
    const person1 = personObservations[i];
    const person2 = personObservations[j];
    
    const pages1 = new Set(person1.occurrences?.map(occ => occ.pageIndex));
    const pages2 = new Set(person2.occurrences?.map(occ => occ.pageIndex));
    
    const sharedPages = Array.from(pages1).filter(p => pages2.has(p));
    
    if (sharedPages.length > 0) {
      network.push({
        person1: person1.observable.name,
        person2: person2.observable.name,
        strength: sharedPages.length
      });
    }
  }
}

// Export for visualization (D3.js, Cytoscape, etc.)
console.log('Network data for graph visualization:', JSON.stringify(network));

Use case: Visualize who works together based on co-mentions in documents.


Common Issues & Solutions

Issue: Too Many False Positives

Problem: LLM extracts irrelevant entities (e.g., "Monday" as a place).

Solutions:

  1. Filter by confidence score (>= 0.8)
  2. Use more specific entity types (avoid Other)
  3. Post-process with custom filters:
const validEntities = observations.filter(obs => {
  // Exclude single-word places (likely days/months)
  if (obs.type === ObservableTypes.Place && !obs.observable.name.includes(' ')) {
    return false;
  }
  
  // Exclude generic organization names
  if (obs.type === ObservableTypes.Organization && 
      ['Company', 'Corporation', 'Inc'].includes(obs.observable.name)) {
    return false;
  }
  
  return true;
});

Issue: Entities Not Deduplicating

Problem: "Alice Johnson" and "A. Johnson" appear as separate entities.

Solutions:

  1. Check alternateNames field (Graphlit populates this automatically)
  2. Manual entity merging (not yet supported—coming soon)
  3. Use entity properties to help matching:
// Enrich entities with properties from source data
// Graphlit will deduplicate entities with matching email addresses

Issue: Missing Entities

Problem: Expected entities not extracted.

Solutions:

  1. Verify entity type is in extractedTypes array
  2. Check PDF text extraction quality (scanned PDFs may have OCR errors)
  3. Use ModelDocument for PDFs (not Text)—vision models are more accurate
  4. Lower confidence threshold temporarily to see if entities are extracted with low confidence

What's Next?

You now have everything you need to build production knowledge graphs. Next steps:

  1. Integrate with your app: Use entity IDs to filter search, build entity-driven UIs
  2. Add more content types: Emails, Slack, meetings—unified knowledge graph
  3. Explore relationships: Build co-occurrence networks, guided search
  4. Scale to production: Webhooks, batch processing, monitoring

Related guides:


Complete Example: Production Knowledge Graph

Here's a complete, production-ready example that ties everything together:

import { Graphlit } from 'graphlit-client';
import {
  FilePreparationServiceTypes,
  EntityExtractionServiceTypes,
  ObservableTypes,
  EntityState
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

async function buildProductionKnowledgeGraph() {
  // 1. Create extraction workflow
  console.log('Creating workflow...');
  const workflow = await graphlit.createWorkflow({
    name: "Production Entity Extraction",
    preparation: {
      jobs: [{
        connector: {
          type: FilePreparationServiceTypes.ModelDocument
        }
      }]
    },
    extraction: {
      jobs: [{
        connector: {
          type: EntityExtractionServiceTypes.ModelDocument,
          extractedTypes: [
            ObservableTypes.Person,
            ObservableTypes.Organization,
            ObservableTypes.Place,
            ObservableTypes.Event,
            ObservableTypes.Product
          ]
        }
      }]
    }
  });

  // 2. Ingest multiple documents
  console.log('Ingesting documents...');
  const documents = [
    'https://company.com/annual-report.pdf',
    'https://company.com/strategy.pdf',
    'https://company.com/team-directory.pdf'
  ];

  const contentIds: string[] = [];

  for (const url of documents) {
    const content = await graphlit.ingestUri(
      url,
      undefined,
      undefined,
      undefined,
      undefined,
      { id: workflow.createWorkflow.id }
    );
    contentIds.push(content.ingestUri.id);
  }

  // 3. Wait for all to complete
  console.log('Processing...');
  for (const contentId of contentIds) {
    let isDone = false;
    while (!isDone) {
      const status = await graphlit.isContentDone(contentId);
      isDone = status.isContentDone.result;
      if (!isDone) await new Promise(r => setTimeout(r, 2000));
    }
  }

  // 4. Query unified knowledge graph
  console.log('\n=== Knowledge Graph Statistics ===');
  
  const people = await graphlit.queryObservables({
    filter: { types: [ObservableTypes.Person], states: [EntityState.Enabled] }
  });
  console.log(`Total people: ${people.observables?.results?.length}`);

  const orgs = await graphlit.queryObservables({
    filter: { types: [ObservableTypes.Organization], states: [EntityState.Enabled] }
  });
  console.log(`Total organizations: ${orgs.observables?.results?.length}`);

  const places = await graphlit.queryObservables({
    filter: { types: [ObservableTypes.Place], states: [EntityState.Enabled] }
  });
  console.log(`Total places: ${places.observables?.results?.length}`);

  // 5. Find top entities by mention count
  console.log('\n=== Most Mentioned People ===');
  people.observables?.results
    ?.sort((a, b) => (b.observable.observationCount || 0) - (a.observable.observationCount || 0))
    .slice(0, 10)
    .forEach((result, i) => {
      console.log(`${i + 1}. ${result.observable.name} (${result.observable.observationCount} mentions)`);
    });

  // 6. Build relationship network
  console.log('\n=== Person-Organization Relationships ===');
  
  // Get all observations from all content
  const allObservations: any[] = [];
  for (const contentId of contentIds) {
    const content = await graphlit.getContent(contentId);
    allObservations.push(...(content.content.observations || []));
  }

  // Find co-occurrences
  const relationships = new Map<string, Set<string>>();

  allObservations
    .filter(obs => obs.type === ObservableTypes.Person)
    .forEach(personObs => {
      const personName = personObs.observable.name;
      
      allObservations
        .filter(obs => obs.type === ObservableTypes.Organization)
        .forEach(orgObs => {
          const orgName = orgObs.observable.name;
          
          // Check if on same pages in same document
          const personPages = new Set(
            personObs.occurrences?.map((occ: any) => `${occ.contentId}-${occ.pageIndex}`)
          );
          const orgPages = new Set(
            orgObs.occurrences?.map((occ: any) => `${occ.contentId}-${occ.pageIndex}`)
          );
          
          const overlap = Array.from(personPages).filter(p => orgPages.has(p));
          
          if (overlap.length > 0) {
            if (!relationships.has(personName)) {
              relationships.set(personName, new Set());
            }
            relationships.get(personName)!.add(orgName);
          }
        });
    });

  // Display top relationships
  Array.from(relationships.entries())
    .sort((a, b) => b[1].size - a[1].size)
    .slice(0, 10)
    .forEach(([person, orgs]) => {
      console.log(`${person} ↔ ${Array.from(orgs).join(', ')}`);
    });

  console.log('\n✓ Knowledge graph complete!');
}

buildProductionKnowledgeGraph().catch(console.error);

Run this, and you'll have a production knowledge graph with entity statistics, relationships, and queryable structure—ready to power search, RAG, and AI applications.

Happy graph building! 🚀

Ready to Build with Graphlit?

Start building AI-powered applications with our API-first platform. Free tier includes 100 credits/month — no credit card required.

No credit card required • 5 minutes to first API call

Building Knowledge Graphs: From Zero to Production | Graphlit Developer Guides