The bug exists only in production. You can't reproduce it locally. Now what?
Step 1: Don't Panic
Production bugs feel urgent. That urgency leads to rushed changes that make things worse.
Take a breath. Open a doc. Start writing down what you know.
Step 2: Gather Evidence
Before changing anything, collect data:
Error messages. Exact text, not paraphrased.
Stack traces. Full traces, including framework code.
Logs. The 5 minutes before and after the error.
Request data. What input triggered this?
Environment. Which server? What version? What config?
Timing. When did it start? Does it correlate with a deployment?
Step 3: Form Hypotheses
Based on the evidence, list possible causes:
- Recent deployment introduced a bug
- Database connection pool exhausted
- Third-party API changed behavior
- Memory leak under high load
- Race condition in concurrent requests
Rank by likelihood. Test the most likely first.
Step 4: Narrow the Scope
Check if it's deployment-related:
# When did this start?
git log --since="2 hours ago" --oneline
# What changed?
git diff HEAD~5..HEADIf the error started exactly when you deployed, the new code is suspect.
Check if it's data-related: Does it happen for all users or specific ones? Specific data often reveals data-dependent bugs.
Check if it's load-related: Does it happen under high traffic? Check your metrics for correlation.
Step 5: Add Instrumentation
If logs aren't enough, add more:
logger.info(
"Processing request",
extra={
"user_id": user.id,
"request_size": len(request.body),
"endpoint": request.path,
}
)Structured logging is searchable. grep through plain text is painful.
Trace the request path:
import uuid
request_id = str(uuid.uuid4())
logger.info(f"[{request_id}] Starting request")
# ... later ...
logger.info(f"[{request_id}] DB query returned {len(results)} rows")
# ... later ...
logger.info(f"[{request_id}] Response sent, status={status}")Now you can follow one request through the entire system.
Step 6: Reproduce in Staging
Can you make staging match production?
- Same data (or a subset)
- Same config
- Same load patterns
If you can reproduce in staging, you can debug safely.
Step 7: The Scientific Method
- Hypothesis: "The error happens when user has > 1000 items"
- Prediction: "If I create a test user with 1001 items, it will fail"
- Test: Create the user, trigger the endpoint
- Observe: Did it fail? How?
- Refine: If wrong, form new hypothesis
Don't guess and deploy. Test your hypothesis first.
Common Production Gotchas
Environment differences:
- Different Python version
- Missing environment variable
- Different timezone settings
- Different file permissions
Data differences:
- Edge cases that don't exist in test data
- Unicode characters
- Very long strings
- Null values where you expect data
Load differences:
- Connection pool exhaustion
- Memory pressure
- Race conditions
- Timeout differences
External dependencies:
- API rate limits
- Certificate expiration
- DNS issues
- Network partitions
When You're Stuck
Rubber duck it. Explain the problem out loud. Often you'll realize what you missed.
Take a break. Fresh eyes catch things tired eyes miss.
Ask for help. Someone else might see the pattern you're blind to.
Binary search the history. If it worked before, git bisect finds when it broke.
After the Fix
Write a postmortem. What happened? Why? How did we fix it? How do we prevent it?
Add monitoring. If this happened once, it can happen again. Alert on it.
Improve logging. What information would have helped? Add it for next time.
Update runbooks. Document the debugging process for future incidents.
The goal isn't just fixing this bug. It's making the next bug easier to find.