Specialized13 min read

Web Scraping & Search: Crawling, Tavily, Exa

Master web content extraction with Graphlit. Learn web crawling patterns, site mapping, search API integration (Tavily, Exa), and web-to-markdown conversion.

Web scraping and search APIs let you ingest competitor sites, documentation, and search results. Graphlit crawls websites and integrates with Tavily and Exa search APIs.

What You'll Learn

  • Web crawling configuration
  • Site mapping and link following
  • Domain and path filtering
  • Tavily search integration
  • Exa search integration
  • Web-to-markdown conversion

Part 1: Web Crawling

import { FeedTypes } from 'graphlit-client/dist/generated/graphql-types';

// Crawl documentation site
const webCrawl = await graphlit.createFeed({
  name: 'Documentation Crawler',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    allowedPaths: ['^/docs/.*'],      // Regex: only crawl /docs/* pages
    excludedPaths: ['^/api/.*', '^/archive/.*']  // Regex: skip these sections
  }
});

What happens:

  1. Starts at uri
  2. Extracts page content → markdown
  3. Follows links matching allowedPaths patterns
  4. Continues until readLimit pages reached

Part 2: Sitemap Crawling

// Crawl from sitemap.xml using Web feed with includeFiles
const sitemapCrawl = await graphlit.createFeed({
  name: 'Sitemap Crawler',
  type: FeedTypes.Web,
  web: {
    uri: 'https://example.com/sitemap.xml',
    includeFiles: true,  // Include files referenced in sitemap
    readLimit: 1000
  }
});

Benefits:

  • Faster than link following
  • Gets all pages from sitemap
  • Respects site structure

Part 3: Tavily Search

import { FeedTypes, SearchServiceTypes } from 'graphlit-client/dist/generated/graphql-types';

// Ingest Tavily search results
const tavilyFeed = await graphlit.createFeed({
  name: 'Tavily Search',
  type: FeedTypes.Search,
  search: {
    type: SearchServiceTypes.Tavily,
    text: 'AI agent frameworks',
    readLimit: 50
  }
});

Use cases:

  • Market research
  • Competitive intelligence
  • Trend analysis

Part 4: Exa Search

import { FeedTypes, SearchServiceTypes } from 'graphlit-client/dist/generated/graphql-types';

// Ingest Exa search results
const exaFeed = await graphlit.createFeed({
  name: 'Exa Search',
  type: FeedTypes.Search,
  search: {
    type: SearchServiceTypes.Exa,
    text: 'machine learning papers',
    readLimit: 100
  }
});

Alternative: Direct search without feed

// One-time search (not continuous monitoring)
const results = await graphlit.searchWeb(
  'machine learning papers',
  SearchServiceTypes.Exa,
  100  // limit
);

Production Patterns

Competitive Intelligence

// Crawl competitor docs
const competitors = [
  'https://competitor1.com/docs',
  'https://competitor2.com/docs',
  'https://competitor3.com/docs'
];

for (const url of competitors) {
  await graphlit.createFeed({
    name: `Competitor: ${url}`,
    type: FeedTypes.Web,
    web: {
      uri: url,
      readLimit: 200,
      allowedPaths: ['^/docs/.*']  // Stay within docs section
    }
  });
}

// Search across all competitors
const features = await graphlit.queryContents({
  filter: {
    search: 'pricing features'
  }
});

Documentation Monitoring

import { TimedPolicyRecurrenceTypes } from 'graphlit-client/dist/generated/graphql-types';

// Monitor docs for changes
const docsCrawl = await graphlit.createFeed({
  name: 'Docs Monitor',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500
  },
  schedulePolicy: {
    recurrenceType: TimedPolicyRecurrenceTypes.Repeat,
    repeatInterval: 'P1D'  // ISO 8601 duration: 1 day
  }
});

Related Guides

Ready to Build with Graphlit?

Start building AI-powered applications with our API-first platform. Free tier includes 100 credits/month — no credit card required.

No credit card required • 5 minutes to first API call

Web Scraping & Search: Crawling, Tavily, Exa | Graphlit Developer Guides