Web Scraping & Search: Crawling, Tavily, Exa

Web scraping and search APIs let you ingest competitor sites, documentation, and search results. Graphlit crawls websites and integrates with Tavily and Exa search APIs.

What You'll Learn

Web crawling configuration
Site mapping and link following
Domain and path filtering
Tavily search integration
Exa search integration
Web-to-markdown conversion

Part 1: Web Crawling

import { FeedServiceTypes } from 'graphlit-client/dist/generated/graphql-types';

// Crawl documentation site
const webCrawl = await graphlit.createFeed({
  name: 'Documentation Crawler',
  type: FeedServiceTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    allowedDomains: ['docs.example.com'],  // Stay on domain
    excludedPaths: ['/api/', '/archive/'],  // Skip sections
    depth: 3  // Follow links 3 levels deep
  }
});

What happens:

Starts at uri
Extracts page content → markdown
Follows links (within allowedDomains)
Continues until readLimit pages or depth reached

Part 2: Sitemap Crawling

// Crawl from sitemap.xml
const sitemapCrawl = await graphlit.createFeed({
  name: 'Sitemap Crawler',
  type: FeedServiceTypes.Sitemap,
  sitemap: {
    uri: 'https://example.com/sitemap.xml',
    readLimit: 1000
  }
});

Benefits:

Faster than link following
Gets all pages
Respects site structure

Part 3: Tavily Search

// Ingest Tavily search results
const tavilyFeed = await graphlit.createFeed({
  name: 'Tavily Search',
  type: FeedServiceTypes.TavilySearch,
  tavilySearch: {
    apiKey: process.env.TAVILY_API_KEY,
    query: 'AI agent frameworks',
    maxResults: 50
  }
});

Use cases:

Market research
Competitive intelligence
Trend analysis

Part 4: Exa Search

// Ingest Exa search results
const exaFeed = await graphlit.createFeed({
  name: 'Exa Search',
  type: FeedServiceTypes.ExaSearch,
  exaSearch: {
    apiKey: process.env.EXA_API_KEY,
    query: 'machine learning papers',
    maxResults: 100
  }
});

Production Patterns

Competitive Intelligence

// Crawl competitor docs
const competitors = [
  'https://competitor1.com/docs',
  'https://competitor2.com/docs',
  'https://competitor3.com/docs'
];

for (const url of competitors) {
  await graphlit.createFeed({
    name: `Competitor: ${url}`,
    type: FeedServiceTypes.Web,
    web: {
      uri: url,
      readLimit: 200,
      allowedDomains: [new URL(url).hostname]
    }
  });
}

// Search across all competitors
const features = await graphlit.queryContents({
  search: 'pricing features'
});

Documentation Monitoring

// Monitor docs for changes
const docsCrawl = await graphlit.createFeed({
  name: 'Docs Monitor',
  type: FeedServiceTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500
  },
  schedulePolicy: {
    recurrenceType: 'DAILY',  // Re-crawl daily
    interval: 1
  }
});

Related Guides

Data Connectors - All connector types
Content Ingestion - Ingestion basics