How We Handle Urgent Bugs in Production (Our Incident Process)

We are a team of four senior engineers. We do not have a dedicated SRE team, a 24/7 on-call rotation, or an incident management platform. What we do have is a production incident process that has kept us honest across twelve shipped products — including MindHyv, Trackelio, VincelIO, and several client projects handling real users and real money.

This post documents our process. It is not enterprise incident management — it is what actually works for a small team that ships fast and maintains multiple products simultaneously.

Detection: Knowing Something Is Wrong

You cannot fix what you do not know about. The first step in any incident process is detection, and at our scale, we rely on three layers:

Automated monitoring. Every production application has basic uptime monitoring through Better Uptime (now Betterstack). We get Slack notifications within 60 seconds of a downtime event. For Supabase-backed projects, we also monitor database connection pool utilization and edge function error rates through the Supabase dashboard.

Error tracking. We use Sentry on every project. Sentry captures unhandled exceptions, groups them by stack trace, and alerts us on new error types or spikes in existing errors. The key configuration is alerting on error frequency, not just occurrence — a single 500 error is noise, but 50 of the same error in ten minutes is a pattern.

Here is our standard Sentry setup for a SvelteKit project:

// src/hooks.server.ts
import * as Sentry from '@sentry/sveltekit';

Sentry.init({
  dsn: import.meta.env.SENTRY_DSN,
  environment: import.meta.env.MODE,
  tracesSampleRate: 0.1, // 10% of transactions for performance monitoring
  beforeSend(event) {
    // Strip sensitive data
    if (event.request?.headers) {
      delete event.request.headers['authorization'];
      delete event.request.headers['cookie'];
    }
    return event;
  },
});

export const handleError = Sentry.handleErrorWithSentry();

// src/hooks.client.ts
import * as Sentry from '@sentry/sveltekit';

Sentry.init({
  dsn: import.meta.env.PUBLIC_SENTRY_DSN,
  environment: import.meta.env.MODE,
  replaysSessionSampleRate: 0,
  replaysOnErrorSampleRate: 1.0, // Capture session replay on every error
  integrations: [Sentry.replayIntegration()],
});

export const handleError = Sentry.handleErrorWithSentry();

The replaysOnErrorSampleRate: 1.0 setting is critical. It captures a session replay for every client-side error, which means we can see exactly what the user did before the error occurred. This has cut our average debugging time in half.

Client reports. Sometimes the monitoring misses things. A feature works but produces wrong data. A flow is broken for users on a specific browser. We get these reports through support channels — email, in-app feedback widgets, or direct messages from clients. We treat every client bug report as a potential incident until proven otherwise.

Code on a screen showing server error logs and debugging output

Severity Classification

Not every bug is an incident, and not every incident requires the same response. We use three severity levels:

SEV-1: Service is down or data integrity is at risk. The application is unreachable, users cannot log in, payments are failing, or data is being corrupted. Drop everything. All hands on deck. We aim to acknowledge within 15 minutes and have a fix deployed within 2 hours.

SEV-2: A major feature is broken for all users. The application is up, but a core feature does not work. Users can still access the product, but they cannot complete a key workflow. One engineer takes point. We aim to fix within the same business day.

SEV-3: A bug affects some users or a non-critical feature. An edge case in a form, a visual glitch on a specific screen size, a slow query that affects performance but not functionality. This goes into the normal sprint backlog and gets prioritized against other work.

The distinction between SEV-1 and SEV-2 matters because SEV-1 interrupts everyone. We have learned that pulling four engineers off their current work for a SEV-2 bug makes things worse, not better. Too many people on a non-critical incident creates confusion and slows down the fix.

Triage: The First 15 Minutes

When an alert fires or a critical bug report comes in, the first engineer to see it owns triage. Here is the checklist:

Verify the issue is real. Check the monitoring dashboard, try to reproduce in production, look at Sentry for related errors. False alarms happen.
Classify severity. Is this a SEV-1, SEV-2, or SEV-3? This determines who needs to be involved and how fast we need to move.
Post in the incident channel. We have a dedicated Slack channel for incidents. The triage engineer posts a short summary: what is broken, who is affected, what severity level, and what they know so far.
Notify the client. For client projects, the triage engineer sends a brief message to the client acknowledging the issue and setting expectations for resolution time. More on this below.
Assign ownership. For SEV-1, everyone drops what they are doing. For SEV-2, one engineer takes point and pulls in help if needed.

The worst thing you can do during triage is start fixing the problem before you understand it. We have a rule: no code changes in the first ten minutes unless the fix is immediately obvious (reverting a deployment, toggling a feature flag). Those ten minutes are for gathering information.

Client Communication

This is the part most technical teams get wrong. Clients do not care about your stack traces. They care about three things: what is broken, when it will be fixed, and whether their data is safe.

Our communication template for SEV-1 and SEV-2 incidents:

Initial notification (within 15 minutes of detection):

We’ve identified an issue affecting [specific feature/functionality]. [Brief description of impact — what users are experiencing]. Our team is actively investigating. We’ll provide an update within [30 minutes / 1 hour].

Update messages (every 30-60 minutes for SEV-1, every 2-4 hours for SEV-2):

Update on [issue]: We’ve identified the root cause as [brief, non-technical explanation]. We’re working on a fix and expect to deploy it by [time]. [Any workaround if available].

Resolution message:

The issue with [feature] has been resolved. [Brief explanation of what happened and what we did to fix it]. We’ll be monitoring closely over the next 24 hours. We’ll share a full incident report within [2 business days].

Notice the language: no jargon, no blame, specific timelines, and proactive updates. The client should never have to ask us for a status update — if they do, we have already failed at communication.

We discussed our broader approach to client relationships in our post on what happens when you email a dev studio.

On-call alert notification system displayed on a mobile phone screen

The Hotfix Workflow

Once we understand the problem, here is how we deploy a fix:

# 1. Create a hotfix branch from production
git checkout main
git pull origin main
git checkout -b hotfix/invoice-calculation-error

# 2. Make the fix
# ... code changes ...

# 3. Write a test that reproduces the bug
# ... test code ...

# 4. Verify the fix locally
npm run test
npm run build

# 5. Push and create a PR
git push -u origin hotfix/invoice-calculation-error
gh pr create --title "fix: correct invoice tax calculation rounding" \
  --body "Fixes SEV-1 incident - tax calculation was truncating instead of rounding"

# 6. Get a review (even for hotfixes)
# At least one other engineer reviews the PR

# 7. Merge and deploy
gh pr merge --squash

A few important practices here:

We always create a branch, even for hotfixes. No direct pushes to main. The few minutes it takes to create a branch and PR are worth the safety net of a review.

We always write a test. The test reproduces the bug before the fix and passes after. This is non-negotiable, even under time pressure. A hotfix without a test is a future regression waiting to happen. We wrote about our testing philosophy in our post on shipping fast without breaking things.

We always get a review. For SEV-1 incidents, the review can be a quick Slack message — “look at this diff, does this look right?” — but another pair of eyes catches mistakes that panic misses.

We deploy to a staging environment first when possible. For SEV-1 where the service is actively down, we may skip staging and go straight to production. For SEV-2, staging first, always.

Feature Flags as an Emergency Brake

We use feature flags for any high-risk deployment. When something goes wrong, toggling a flag off is faster than reverting a deployment.

// Simple feature flag check
const flags = await db
  .select()
  .from(featureFlags)
  .where(eq(featureFlags.key, 'new_invoice_engine'));

const useNewEngine = flags[0]?.enabled ?? false;

if (useNewEngine) {
  return calculateInvoiceV2(lineItems, taxRate);
} else {
  return calculateInvoiceV1(lineItems, taxRate);
}

We do not use a dedicated feature flag service — for our scale, a database table with key-value pairs and a simple admin toggle is sufficient. The critical thing is that the flag check exists before risky code ships.

This has saved us twice on MindHyv. Both times, a new feature worked perfectly in staging but caused issues in production due to data edge cases we did not anticipate. Toggling the flag off took 30 seconds. Rolling back the deployment would have taken 10 minutes.

The Post-Mortem

Every SEV-1 and SEV-2 incident gets a post-mortem within two business days. Not as punishment — as learning. Here is our template:

## Incident Report: [Title]

**Date:** [Date]
**Duration:** [Time from detection to resolution]
**Severity:** [SEV-1 / SEV-2]
**Impact:** [What users experienced]

### Timeline

- [HH:MM] Alert triggered / issue reported
- [HH:MM] Engineer acknowledged, began triage
- [HH:MM] Root cause identified
- [HH:MM] Fix deployed to staging
- [HH:MM] Fix deployed to production
- [HH:MM] Confirmed resolved, monitoring

### Root Cause

[Clear, specific explanation of what went wrong technically]

### Contributing Factors

[What conditions allowed this to happen? Missing test? Insufficient monitoring?
Data edge case? Deployment without staging?]

### Resolution

[What we did to fix it]

### Action Items

- [ ] [Specific action to prevent recurrence] — Owner: [Name] — Due: [Date]
- [ ] [Additional monitoring or alerting] — Owner: [Name] — Due: [Date]
- [ ] [Test coverage improvement] — Owner: [Name] — Due: [Date]

The most important section is Action Items. A post-mortem without action items is just storytelling. Every post-mortem should produce at least one concrete change — a new test, a new alert, a process change, a code improvement — that makes this specific category of failure less likely in the future.

We share the post-mortem with clients for SEV-1 incidents. It builds trust. When a client sees that you not only fixed the problem but analyzed it systematically and took steps to prevent recurrence, they know they are working with a team that takes reliability seriously.

Team coordinating a crisis management response around a shared workspace

Monitoring Checklist for New Projects

When we start a new project, this is the monitoring we set up during the first week:

Uptime monitoring — Better Uptime checks the health endpoint every 30 seconds
Error tracking — Sentry on both server and client with session replay enabled
Database monitoring — Supabase dashboard for connection pool and query performance
Deployment notifications — Vercel or Cloudflare deployment status posts to Slack
Log aggregation — Structured logging with the project name and environment in every log entry

// Structured logging utility
function log(level: 'info' | 'warn' | 'error', message: string, meta?: Record<string, unknown>) {
  const entry = {
    timestamp: new Date().toISOString(),
    level,
    project: 'mindhyv',
    environment: process.env.NODE_ENV,
    message,
    ...meta,
  };
  console[level](JSON.stringify(entry));
}

// Usage
log('error', 'Invoice calculation failed', {
  invoiceId: invoice.id,
  userId: user.id,
  error: err.message,
});

Structured JSON logs might seem like overkill for a small team, but they pay for themselves the first time you need to search logs for a specific user’s request during an incident.

What We Have Learned

After handling production incidents across a dozen products for five years, here is what we know:

Speed of detection matters more than speed of resolution. An incident that takes 30 minutes to detect and 30 minutes to fix has a 60-minute user impact. An incident that takes 5 minutes to detect and 45 minutes to fix has a 50-minute user impact, but more importantly, you are communicating with the client during the fix instead of after.

Communication is the difference between a crisis and a hiccup. Clients can handle downtime. They cannot handle silence. A proactive message that says “we know, we are working on it” transforms the client’s experience from “is anyone even watching this?” to “they are on it.”

Post-mortems are the compound interest of reliability. Each one makes the next incident less likely, less severe, or shorter. Over five years and dozens of post-mortems, our mean time to resolution has dropped significantly.

Keep the process lightweight. We tried PagerDuty, Statuspage, and formal incident commander roles. For a team of four, all of that is overhead. A Slack channel, a severity system, and a post-mortem template cover what we need.

If you are building a product and want a team that takes production reliability seriously from day one, reach out at hello@threshline.com.