Guide

Incident Response: Find Root Cause in 2 Minutes, Not 30

When production breaks, every minute counts. Search Slack incidents, GitHub changes, and Sentry errors in one query—not five tools.

3:47 AM. Your phone buzzes. Checkout API is down.

You're on-call. You need to:

  1. Check error logs (Sentry)
  2. Search Slack #incidents for similar issues
  3. Check GitHub for recent deployments
  4. Search Notion for runbooks
  5. Ask teammates if they remember this error

By the time you've gathered context, it's 4:15 AM—28 minutes wasted before you even start fixing.

Zine transforms incident response: Search once, get everything—error logs (via Sentry MCP), Slack incident history, recent GitHub changes, and runbooks. Root cause identified in 2 minutes.

This guide shows DevOps, SREs, and on-call engineers how to set up unified incident context that saves hours during critical moments.


Table of Contents

  1. The Incident Response Time Sink
  2. The Zine Incident Response Workflow
  3. Setup: One-Time Configuration
  4. During an Incident: The 2-Minute Context Gather
  5. Connecting Sentry MCP for Error Logs
  6. Post-Incident: Automated Documentation
  7. Real Incident Examples
  8. Best Practices

The Incident Response Time Sink

Traditional Incident Response Flow

Step 1: Identify the problem (2 minutes)

  • Alert comes in: "Checkout API 500 errors spiking"
  • Check monitoring dashboard (Datadog, New Relic, etc.)

Step 2: Check error logs (5-10 minutes)

  • Open Sentry or CloudWatch
  • Filter by service: checkout-api
  • Filter by timeframe: Last hour
  • Read stack traces, identify error patterns

Step 3: Search Slack #incidents (5-10 minutes)

  • "Has this happened before?"
  • Manually scroll through channel
  • Find similar incident from 3 months ago
  • Read 50-message thread to find resolution

Step 4: Check recent deployments (5-10 minutes)

  • Open GitHub
  • Check recent PRs merged to production
  • Read PR descriptions, commits
  • Identify suspicious changes

Step 5: Search runbooks (3-5 minutes)

  • Open Notion or Confluence
  • Search "checkout troubleshooting"
  • Find (hopefully) relevant runbook

Step 6: Ask teammates (10-15 minutes)

  • Slack: "Anyone seen checkout errors before?"
  • Wait for responses
  • Senior engineer shares context from memory

Total time: 30-45 minutes of context gathering before you start fixing.

During that time: Users can't checkout. Revenue lost. Stress accumulates.


The Zine Incident Response Workflow

Unified Incident Context in One Query

3:47 AM Alert: Checkout API errors

3:48 AM - Open Zine, one query:

Checkout API errors OR timeouts

3:49 AM - Zine returns (in 15 seconds):

  1. Sentry (via MCP): 87 errors in last hour, stack trace shows Redis timeout
  2. Slack #incidents (2 months ago): Same error, resolution documented (increase Redis timeout)
  3. GitHub PR #567: Merged yesterday, "Optimize Redis cache" (modified timeout config)
  4. Slack #engineering (yesterday): "Concerns about new Redis timeout settings" (Alice raised this)
  5. Notion runbook: "Redis Troubleshooting" (timeout adjustment procedure)
  6. GitHub PR #601 (2 months ago): Past fix for same issue

3:49 AM - Hypothesis identified:

  • Recent Redis config change (PR #567) set timeout too aggressive
  • This caused the same issue 2 months ago (PR #601 fixed it)
  • Alice warned about this yesterday in Slack

3:50 AM - Fix:

  • Revert Redis timeout config
  • OR: Increase timeout based on runbook
  • Deploy fix

Total time: 3 minutes from alert to fix deployment.

Time saved: 27-42 minutes.


Setup: One-Time Configuration

Step 1: Connect Core Tools to Zine

Required:

  1. Slack: Connect #incidents, #engineering, #devops channels
  2. GitHub: Connect repos (especially backend services)
  3. Notion: Connect runbooks, architecture docs

Recommended: 4. Meeting recordings: Past architecture/incident review meetings 5. Email: Vendor discussions, escalation threads

Initial sync: 1-3 hours (one-time)

Step 2: Connect Sentry MCP (Optional but Powerful)

Sentry offers an MCP server for error tracking.

Add Sentry MCP to Zine (as MCP client):

  1. Zine Settings → MCP Servers → Add Server
  2. Select "Sentry MCP"
  3. Enter Sentry API key
  4. Authorize

Now Zine can query Sentry errors in addition to searching Slack/GitHub.

Benefit: One query gets error logs + team discussions + code changes.

Step 3: Create Saved Views for Incidents

In Zine, create saved views:

"Recent Incidents":

source:slack channel:#incidents after:30d

"Production Changes":

source:github label:production merged after:7d

"Critical Bugs":

source:github label:critical state:open

Time saved: One-click access during incidents.

Step 4: Set Up Alert (Optional)

Create an alert for proactive monitoring:

Alert Name: "Incident Monitor"
Query:

Slack #incidents new threads
OR GitHub issues labeled 'production' OR 'outage'
OR Sentry errors increased by 50%+
from the last hour

Schedule: Hourly
Delivery: Slack DM

Benefit: Know about incidents immediately, even if you're not actively monitoring.


During an Incident: The 2-Minute Context Gather

Query Templates for Common Incidents

API Errors:

[service-name] API errors OR timeouts OR 500

Database Issues:

Database OR postgres OR mongodb slow OR timeout OR connection

Cache Problems:

Redis OR memcached OR cache timeout OR eviction

Deployment Issues:

Deployment OR deploy failed OR rollback recent

Performance Degradation:

Slow OR performance OR latency [service-name]

What to Look For in Results

1. Past Incidents (Slack #incidents):

  • "Has this happened before?"
  • If yes: How was it resolved?
  • Time saved: Don't re-diagnose

2. Recent Changes (GitHub):

  • PRs merged in last 24-48 hours
  • Changes to affected service
  • Likely culprits for new bugs

3. Known Issues (GitHub Issues):

  • Open issues about this service
  • Known bugs or limitations
  • Workarounds documented

4. Runbooks (Notion):

  • Troubleshooting procedures
  • Recovery steps
  • Contact information for escalation

5. Team Knowledge (Slack #engineering):

  • Discussions about this service
  • Known gotchas or edge cases
  • Expertise (who knows this system best)

Connecting Sentry MCP for Error Logs

Why Sentry MCP?

Without Sentry MCP:

  • Search Zine → Get Slack/GitHub context
  • Open Sentry separately → Get error logs
  • Manually correlate the two

With Sentry MCP:

  • Search Zine → Get Slack/GitHub context + Sentry error logs in one query
  • AI correlates automatically

Setup Sentry MCP

Option 1: Connect to Zine (Recommended)

  1. Zine Settings → MCP Servers → Add Server
  2. Select "Sentry"
  3. Enter:
    Sentry API Key: your-sentry-api-key
    Sentry Organization: your-org-slug
    
  4. Save

Now when you query Zine, it can include Sentry data.

Option 2: Connect Directly in Cursor

Add Sentry MCP alongside Zine MCP in Cursor config:

{
  "mcpServers": {
    "zine": { ... },
    "sentry": {
      "type": "sse",
      "url": "sentry-mcp-endpoint",
      "apiKey": "your-sentry-key"
    }
  }
}

Querying Sentry via Zine

Query:

Checkout API Sentry errors in the last hour

Zine returns:

  • Sentry errors (87 errors, stack traces, affected users)
  • Slack #incidents (team discussion if any)
  • GitHub recent changes (PRs merged recently)

All unified.


Post-Incident: Automated Documentation

Generate Incident Report

After resolving an incident, use Zine to generate postmortem:

Query in Zine chat:

Generate incident report for checkout API timeout on November 13, 2025

Zine AI compiles:

  1. What happened: Timeline of errors (Sentry data)
  2. Root cause: GitHub PR #567 changed Redis timeout (too aggressive)
  3. Team response: Slack #incidents thread (Bob identified issue, Alice deployed fix)
  4. Resolution: Config rolled back, errors stopped
  5. Follow-up actions: Update runbook, add monitoring for Redis timeouts

Export to Notion: Click "Export" → saves as Notion page in "Incident Reports" database.

Time saved: 30-45 minutes writing postmortem manually.


Real Incident Examples

Example 1: Database Connection Exhaustion

Alert: 4:00 AM - API returning 500 errors

Query Zine:

Database connection errors OR pool exhausted

Returns:

  • Slack #incidents (6 months ago): Same error, resolution: Increase connection pool size
  • GitHub PR #234: Past fix
  • GitHub PR #789: Merged 3 days ago, modified database config (potential cause)
  • Notion runbook: "Database Connection Issues"

Root cause identified in 3 minutes: Recent PR reduced connection pool size (optimization attempt backfired).

Fix: Revert connection pool change.

Downtime: 8 minutes (vs. 45 minutes without Zine context).


Example 2: Redis Cache Eviction Bug

Alert: 2:00 PM - Checkout flow broken

Query Zine:

Checkout OR payment redis OR cache

Returns:

  • Sentry: 143 errors, "Redis key not found"
  • Slack #engineering (last week): "Changed Redis eviction policy to save memory"
  • GitHub PR #567: "Update Redis config" (merged 3 days ago)
  • Slack #engineering (last week): Alice warned: "This might evict active session keys"

Root cause identified in 2 minutes: New eviction policy is too aggressive, evicting session keys prematurely.

Fix: Adjust eviction policy to exclude session keys.

Alice's warning was right: Slack context prevented surprise. Team knew this was a risk.


Example 3: Third-Party API Outage

Alert: 5:30 PM - Payment processing failing

Query Zine:

Payment API errors OR Stripe OR payment gateway

Returns:

  • Slack #incidents: Bob posted 10 minutes ago "Stripe status page shows outage"
  • Email (from Stripe): Incident notification received 15 minutes ago
  • Notion runbook: "Third-Party Outage Response" (enable fallback payment processor)
  • GitHub: Fallback implementation in payment-service

Root cause identified in 1 minute: Stripe outage (external, not our bug).

Response: Enable fallback processor, notify customers, monitor Stripe status.

No time wasted debugging our code (Slack context immediately indicated external issue).


Best Practices

1. Connect Tools Before Incidents Happen

Don't wait until 3 AM:

  • Set up Slack, GitHub, Sentry integration during normal hours
  • Test queries during calm periods
  • Create saved views for common incident types

Preparation pays off when seconds matter.

2. Document Resolutions in Slack

After fixing:

  • Post resolution in Slack #incidents
  • Include: Root cause, fix applied, prevention steps

Why: Next time this happens (it will), Zine finds this thread immediately.

Example post:

Checkout API timeout resolved.
Root cause: PR #567 set Redis timeout to 1000ms (too low).
Fix: Increased to 3000ms.
Prevention: Added monitoring for Redis timeouts.
Runbook updated in Notion.

3. Use Time Filters Strategically

Recent changes (last 24-48 hours):

after:24h deploy OR merged

Past incidents (last 6 months):

after:6mo [error-pattern]

Why: New bugs likely caused by recent changes. Historical incidents provide resolution patterns.

4. Create Incident-Specific Saved Views

"Recent Deployments":

source:github merged to:production after:48h

"Open Production Issues":

source:github label:production state:open

"Past Incidents":

source:slack channel:#incidents after:30d

Benefit: One-click access during high-stress incidents.

5. Set Up Proactive Alerts

Don't wait for pages:

  • Alert when Sentry errors spike
  • Alert when Slack #incidents has new thread
  • Alert when GitHub issues labeled "production" are created

Result: Catch issues before they become full outages.


Next Steps

Now that you understand incident response with Zine:

  1. Connect Tools: Slack #incidents, GitHub, Notion runbooks
  2. Test During Calm: Practice queries before real incidents
  3. Create Saved Views: For recent changes, open issues, past incidents
  4. Set Up Alerts: Proactive monitoring
  5. Add Sentry MCP: Unified error logs + context
  6. Document Runbook: Update team's incident response process to include Zine

Related Guides:

Learn More:


Every minute counts during incidents. Don't waste 30 gathering context.

Ready to Build with Graphlit?

Start building AI-powered applications with our API-first platform. Free tier includes 100 credits/month — no credit card required.

No credit card required • 5 minutes to first API call

Incident Response: Find Root Cause in 2 Minutes, Not 30 | Graphlit Developer Guides