Technology

Production incident triage

On-call gets paged and needs deploy history, error context, and the runbook in one shot, not three tabs and a Slack scroll.

Build Agents

Without skills & connected tools

We got paged for elevated 5xx on checkout. I just opened the laptop. What now?

Deep breath. Without your stack context I will stay high level. The shape is almost always the same:

See whether error rates spiked with a deploy or config change; roll back if it lines up.
Tighten the blast radius in logs (payments, auth, inventory, whatever owns checkout).
Get or update an incident thread and pull in the owning on-call.

If you drop repo names, service owners, or a dashboard link, I can narrow this. Want a generic triage checklist, or help mapping who to page and when?

Here's a concise snapshot:

GitHub: 3 open PRs on release/acme-launch; the API rate-limit change is still in review.
Notion: Product brief is marked "Ready for review"; decision log notes pricing approval is pending.
Google Workspace: Next calendar block with stakeholders is tomorrow 10:00 AM; the shared prep doc was updated this morning.

I can refresh these sources before the meeting, draft the stakeholder follow-up, or break this into owners and next steps -- what should I run first?

When should I escalate this beyond me, and who should be in the loop if it drags on?

Without your written escalation policy I cannot name roles, but the pattern is: you own first triage and comms in the incident channel, then widen the loop when impact or time boxed. If error rates stay bad after one pass of obvious checks (recent deploy, config flag, dependency status), page the owning service team—not the whole company. If symptoms point at a vendor or shared platform, pull that owner or bridge early. If your policy says "page X after N minutes with no theory," follow that over generic advice. Paste severity tiers or a link to your runbook if you want this tailored.

Stay on #incidents for status; someone already ack'd there. Escalate wider only after you have run the runbook header: drain checkout workers, verify payment-provider status, then scale API replicas if checkout p95 is still bad. With a deploy on api/checkout ~42m ago plus an open config PR, if those steps do not improve 5xx within your policy window, page the checkout service owner and decide together on rollback or the config change—not before you have payment-provider signal. If the provider is red, open a vendor path before piling on app engineers. Want me to post that escalation framing in Slack?

With skills & connected tools

We got paged for elevated 5xx on checkout. I just opened the laptop. What now?

I don't have access to your tools, so I can't see live status.

Checkout 5xx. Here is what GitHub, Slack, and the runbook line up on:

GitHub: Last green deploy on api/checkout was 42 minutes ago; a config change PR is open and unmerged.
Slack: #incidents thread already has a graph link; on-call acknowledged 3m ago.
Runbook: Standard sequence: drain checkout workers, confirm payment-provider status, then scale API replicas if checkout p95 stays elevated.

I can post a tight status to #incidents, spin up the runbook checklist, or keep summarizing health checks. What do you want first?

When should I escalate this beyond me, and who should be in the loop if it drags on?

Production incident triage - Supercharge

Production incident triage