Web scraping and search APIs let you ingest competitor sites, documentation, and search results. Graphlit crawls websites and integrates with Tavily and Exa search APIs.
What You'll Learn
- Web crawling configuration
- Site mapping and link following
- Domain and path filtering
- Tavily search integration
- Exa search integration
- Web-to-markdown conversion
Part 1: Web Crawling
import { FeedTypes } from 'graphlit-client/dist/generated/graphql-types';
// Crawl documentation site
const webCrawl = await graphlit.createFeed({
name: 'Documentation Crawler',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 500,
allowedPaths: ['^/docs/.*'], // Regex: only crawl /docs/* pages
excludedPaths: ['^/api/.*', '^/archive/.*'] // Regex: skip these sections
}
});
What happens:
- Starts at
uri - Extracts page content → markdown
- Follows links matching
allowedPathspatterns - Continues until
readLimitpages reached
Part 2: Sitemap Crawling
// Crawl from sitemap.xml using Web feed with includeFiles
const sitemapCrawl = await graphlit.createFeed({
name: 'Sitemap Crawler',
type: FeedTypes.Web,
web: {
uri: 'https://example.com/sitemap.xml',
includeFiles: true, // Include files referenced in sitemap
readLimit: 1000
}
});
Benefits:
- Faster than link following
- Gets all pages from sitemap
- Respects site structure
Part 3: Tavily Search
import { FeedTypes, SearchServiceTypes } from 'graphlit-client/dist/generated/graphql-types';
// Ingest Tavily search results
const tavilyFeed = await graphlit.createFeed({
name: 'Tavily Search',
type: FeedTypes.Search,
search: {
type: SearchServiceTypes.Tavily,
text: 'AI agent frameworks',
readLimit: 50
}
});
Use cases:
- Market research
- Competitive intelligence
- Trend analysis
Part 4: Exa Search
import { FeedTypes, SearchServiceTypes } from 'graphlit-client/dist/generated/graphql-types';
// Ingest Exa search results
const exaFeed = await graphlit.createFeed({
name: 'Exa Search',
type: FeedTypes.Search,
search: {
type: SearchServiceTypes.Exa,
text: 'machine learning papers',
readLimit: 100
}
});
Alternative: Direct search without feed
// One-time search (not continuous monitoring)
const results = await graphlit.searchWeb(
'machine learning papers',
SearchServiceTypes.Exa,
100 // limit
);
Production Patterns
Competitive Intelligence
// Crawl competitor docs
const competitors = [
'https://competitor1.com/docs',
'https://competitor2.com/docs',
'https://competitor3.com/docs'
];
for (const url of competitors) {
await graphlit.createFeed({
name: `Competitor: ${url}`,
type: FeedTypes.Web,
web: {
uri: url,
readLimit: 200,
allowedPaths: ['^/docs/.*'] // Stay within docs section
}
});
}
// Search across all competitors
const features = await graphlit.queryContents({
filter: {
search: 'pricing features'
}
});
Documentation Monitoring
import { TimedPolicyRecurrenceTypes } from 'graphlit-client/dist/generated/graphql-types';
// Monitor docs for changes
const docsCrawl = await graphlit.createFeed({
name: 'Docs Monitor',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 500
},
schedulePolicy: {
recurrenceType: TimedPolicyRecurrenceTypes.Repeat,
repeatInterval: 'P1D' // ISO 8601 duration: 1 day
}
});
Related Guides
- Data Connectors - All connector types
- Content Ingestion - Ingestion basics