Debugging Production Issues

The bug exists only in production. You can't reproduce it locally. Now what?

Step 1: Don't Panic

Production bugs feel urgent. That urgency leads to rushed changes that make things worse.

Take a breath. Open a doc. Start writing down what you know.

Step 2: Gather Evidence

Before changing anything, collect data:

Error messages. Exact text, not paraphrased.

Stack traces. Full traces, including framework code.

Logs. The 5 minutes before and after the error.

Request data. What input triggered this?

Environment. Which server? What version? What config?

Timing. When did it start? Does it correlate with a deployment?

Step 3: Form Hypotheses

Based on the evidence, list possible causes:

Recent deployment introduced a bug
Database connection pool exhausted
Third-party API changed behavior
Memory leak under high load
Race condition in concurrent requests

Rank by likelihood. Test the most likely first.

Step 4: Narrow the Scope

Check if it's deployment-related:

# When did this start?
git log --since="2 hours ago" --oneline
 
# What changed?
git diff HEAD~5..HEAD

If the error started exactly when you deployed, the new code is suspect.

Check if it's data-related: Does it happen for all users or specific ones? Specific data often reveals data-dependent bugs.

Check if it's load-related: Does it happen under high traffic? Check your metrics for correlation.

Step 5: Add Instrumentation

If logs aren't enough, add more:

logger.info(
    "Processing request",
    extra={
        "user_id": user.id,
        "request_size": len(request.body),
        "endpoint": request.path,
    }
)

Structured logging is searchable. grep through plain text is painful.

Trace the request path:

import uuid
 
request_id = str(uuid.uuid4())
logger.info(f"[{request_id}] Starting request")
# ... later ...
logger.info(f"[{request_id}] DB query returned {len(results)} rows")
# ... later ...
logger.info(f"[{request_id}] Response sent, status={status}")

Now you can follow one request through the entire system.

Step 6: Reproduce in Staging

Can you make staging match production?

Same data (or a subset)
Same config
Same load patterns

If you can reproduce in staging, you can debug safely.

Step 7: The Scientific Method

Hypothesis: "The error happens when user has > 1000 items"
Prediction: "If I create a test user with 1001 items, it will fail"
Test: Create the user, trigger the endpoint
Observe: Did it fail? How?
Refine: If wrong, form new hypothesis

Don't guess and deploy. Test your hypothesis first.

Common Production Gotchas

Environment differences:

Different Python version
Missing environment variable
Different timezone settings
Different file permissions

Data differences:

Edge cases that don't exist in test data
Unicode characters
Very long strings
Null values where you expect data

Load differences:

Connection pool exhaustion
Memory pressure
Race conditions
Timeout differences

External dependencies:

API rate limits
Certificate expiration
DNS issues
Network partitions

When You're Stuck

Rubber duck it. Explain the problem out loud. Often you'll realize what you missed.

Take a break. Fresh eyes catch things tired eyes miss.

Ask for help. Someone else might see the pattern you're blind to.

Binary search the history. If it worked before, git bisect finds when it broke.

After the Fix

Write a postmortem. What happened? Why? How did we fix it? How do we prevent it?

Add monitoring. If this happened once, it can happen again. Alert on it.

Improve logging. What information would have helped? Add it for next time.

Update runbooks. Document the debugging process for future incidents.

The goal isn't just fixing this bug. It's making the next bug easier to find.

React to this post:

#Step 1: Don't Panic

#Step 2: Gather Evidence

#Step 3: Form Hypotheses

#Step 4: Narrow the Scope

#Step 5: Add Instrumentation

#Step 6: Reproduce in Staging

#Step 7: The Scientific Method

#Common Production Gotchas

#When You're Stuck

#After the Fix

Keep Reading

Need help shipping fast?