Specialized13 min read

Web Scraping & Search: Crawling, Tavily, Exa

Master web content extraction with Graphlit. Learn web crawling patterns, site mapping, search API integration (Tavily, Exa), and web-to-markdown conversion.

Web scraping and search APIs let you ingest competitor sites, documentation, and search results. Graphlit crawls websites through the website integration, monitors RSS feeds, and integrates with Tavily and Exa search APIs.

What You'll Learn

  • Web crawling configuration
  • Site mapping and link following
  • Domain and path filtering
  • Tavily search integration
  • Exa search integration
  • Web-to-markdown conversion

Part 1: Web Crawling

import { FeedTypes } from 'graphlit-client/dist/generated/graphql-types';

// Crawl documentation site
const webCrawl = await graphlit.createFeed({
  name: 'Documentation Crawler',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500,
    allowedPaths: ['^/docs/.*'],      // Regex: only crawl /docs/* pages
    excludedPaths: ['^/api/.*', '^/archive/.*']  // Regex: skip these sections
  }
});

What happens:

  1. Starts at uri
  2. Extracts page content → markdown
  3. Follows links matching allowedPaths patterns
  4. Continues until readLimit pages reached

Part 2: Sitemap Crawling

// Crawl from sitemap.xml using Web feed with includeFiles
const sitemapCrawl = await graphlit.createFeed({
  name: 'Sitemap Crawler',
  type: FeedTypes.Web,
  web: {
    uri: 'https://example.com/sitemap.xml',
    includeFiles: true,  // Include files referenced in sitemap
    readLimit: 1000
  }
});

Benefits:

  • Faster than link following
  • Gets all pages from sitemap
  • Respects site structure

Part 3: Tavily Search

import { FeedTypes, SearchServiceTypes } from 'graphlit-client/dist/generated/graphql-types';

// Ingest Tavily search results
const tavilyFeed = await graphlit.createFeed({
  name: 'Tavily Search',
  type: FeedTypes.Search,
  search: {
    type: SearchServiceTypes.Tavily,
    text: 'AI agent frameworks',
    readLimit: 50
  }
});

Use cases:

  • Market research
  • Competitive intelligence
  • Trend analysis

Part 4: Exa Search

import { FeedTypes, SearchServiceTypes } from 'graphlit-client/dist/generated/graphql-types';

// Ingest Exa search results
const exaFeed = await graphlit.createFeed({
  name: 'Exa Search',
  type: FeedTypes.Search,
  search: {
    type: SearchServiceTypes.Exa,
    text: 'machine learning papers',
    readLimit: 100
  }
});

Alternative: Direct search without feed

// One-time search (not continuous monitoring)
const results = await graphlit.searchWeb(
  'machine learning papers',
  SearchServiceTypes.Exa,
  100  // limit
);

Production Patterns

Competitive Intelligence

// Crawl competitor docs
const competitors = [
  'https://competitor1.com/docs',
  'https://competitor2.com/docs',
  'https://competitor3.com/docs'
];

for (const url of competitors) {
  await graphlit.createFeed({
    name: `Competitor: ${url}`,
    type: FeedTypes.Web,
    web: {
      uri: url,
      readLimit: 200,
      allowedPaths: ['^/docs/.*']  // Stay within docs section
    }
  });
}

// Search across all competitors
const features = await graphlit.queryContents({
  filter: {
    search: 'pricing features'
  }
});

Documentation Monitoring

import { TimedPolicyRecurrenceTypes } from 'graphlit-client/dist/generated/graphql-types';

// Monitor docs for changes
const docsCrawl = await graphlit.createFeed({
  name: 'Docs Monitor',
  type: FeedTypes.Web,
  web: {
    uri: 'https://docs.example.com',
    readLimit: 500
  },
  schedulePolicy: {
    recurrenceType: TimedPolicyRecurrenceTypes.Repeat,
    repeatInterval: 'P1D'  // ISO 8601 duration: 1 day
  }
});

Related Guides

Ready to Build with Graphlit?

Start building AI-powered applications with our API-first platform. Free tier includes 100 credits/month — no credit card required.

No credit card required • 5 minutes to first API call