Web scraping and search APIs let you ingest competitor sites, documentation, and search results. Graphlit crawls websites through the website integration, monitors RSS feeds, and integrates with Tavily and Exa search APIs.
What You'll Learn
- Web crawling configuration
- Site mapping and link following
- Domain and path filtering
- Tavily search integration
- Exa search integration
- Web-to-markdown conversion
Part 1: Web Crawling
import { FeedTypes } from 'graphlit-client/dist/generated/graphql-types';
// Crawl documentation site
const webCrawl = await graphlit.createFeed({
name: 'Documentation Crawler',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 500,
allowedPaths: ['^/docs/.*'], // Regex: only crawl /docs/* pages
excludedPaths: ['^/api/.*', '^/archive/.*'] // Regex: skip these sections
}
});
What happens:
- Starts at
uri - Extracts page content → markdown
- Follows links matching
allowedPathspatterns - Continues until
readLimitpages reached
Part 2: Sitemap Crawling
// Crawl from sitemap.xml using Web feed with includeFiles
const sitemapCrawl = await graphlit.createFeed({
name: 'Sitemap Crawler',
type: FeedTypes.Web,
web: {
uri: 'https://example.com/sitemap.xml',
includeFiles: true, // Include files referenced in sitemap
readLimit: 1000
}
});
Benefits:
- Faster than link following
- Gets all pages from sitemap
- Respects site structure
Part 3: Tavily Search
import { FeedTypes, SearchServiceTypes } from 'graphlit-client/dist/generated/graphql-types';
// Ingest Tavily search results
const tavilyFeed = await graphlit.createFeed({
name: 'Tavily Search',
type: FeedTypes.Search,
search: {
type: SearchServiceTypes.Tavily,
text: 'AI agent frameworks',
readLimit: 50
}
});
Use cases:
- Market research
- Competitive intelligence
- Trend analysis
Part 4: Exa Search
import { FeedTypes, SearchServiceTypes } from 'graphlit-client/dist/generated/graphql-types';
// Ingest Exa search results
const exaFeed = await graphlit.createFeed({
name: 'Exa Search',
type: FeedTypes.Search,
search: {
type: SearchServiceTypes.Exa,
text: 'machine learning papers',
readLimit: 100
}
});
Alternative: Direct search without feed
// One-time search (not continuous monitoring)
const results = await graphlit.searchWeb(
'machine learning papers',
SearchServiceTypes.Exa,
100 // limit
);
Production Patterns
Competitive Intelligence
// Crawl competitor docs
const competitors = [
'https://competitor1.com/docs',
'https://competitor2.com/docs',
'https://competitor3.com/docs'
];
for (const url of competitors) {
await graphlit.createFeed({
name: `Competitor: ${url}`,
type: FeedTypes.Web,
web: {
uri: url,
readLimit: 200,
allowedPaths: ['^/docs/.*'] // Stay within docs section
}
});
}
// Search across all competitors
const features = await graphlit.queryContents({
filter: {
search: 'pricing features'
}
});
Documentation Monitoring
import { TimedPolicyRecurrenceTypes } from 'graphlit-client/dist/generated/graphql-types';
// Monitor docs for changes
const docsCrawl = await graphlit.createFeed({
name: 'Docs Monitor',
type: FeedTypes.Web,
web: {
uri: 'https://docs.example.com',
readLimit: 500
},
schedulePolicy: {
recurrenceType: TimedPolicyRecurrenceTypes.Repeat,
repeatInterval: 'P1D' // ISO 8601 duration: 1 day
}
});
Related Guides
- Website Integration - Crawl and ingest web pages
- RSS Feed Integration - Monitor blogs and feeds
- Data Connectors - All connector types
- Content Ingestion - Ingestion basics