Incident Response Management#

The SRE team at Stakater has both the responsibility and the authority to resolve incidents.

Incidents are anomalous conditions that result in — or may lead to — service degradation or outages. These events may require human intervention to avert disruptions or restore service to operational status. Incidents should always be given immediate attention.

Stakater's incident management system (IMS) is based on Google's IMS which in turn is based on the Incident Command System.

The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:

Well-defined roles and responsibilities and workflow for members of the incident team
Control points to manage the flow information and the resolution path
A root cause analysis where follow-up actions, lessons, and techniques are extracted and shared

Tools#

Tools used to facilitate incident management at Stakater:

Alertmanager - for creating alerts from Prometheus
Grafana OnCall - for paging of alerts
Slack - for asynchronous communication
Google Meet - for synchronous communication

Incident Ownership#

By default, the SRE on-call is the owner of the incident.

Roles and Responsibilities#

Clear role responsibilities is important during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. The focus of incident response should be on resolving the incident, not on resolving confusion on who should do what - clear roles and responsibilities prevent confusions around accountability when an incident actually happens.

The three main roles in incident response are:

Incident Commander (IC) - leads the incident response
- Commands and coordinates the incident response
- Assumes all roles that have not been delegated yet
- Communicates effectively
- Escalates alerts: Notifies the team until someone acknowledges the alert and takes on the CL role
Communications Lead (CL) - reports to the IC
- Public face of the incident response team
- Provides periodic updates to customers and the incident response team
- Manages inquiries about the incident
Operations or Ops Lead (OL) - report to the IC
- Responds to the incident by applying operational tools to mitigate or resolve the incident

One person can be assigned to one or multiple roles. The most important thing is that all roles are needed to effectively deal with an incident.

flowchart TD
    CL --> |Gathers incident response status|IC
    CL --> |Updates customer|Customer
    CL --> |Updates internal team|Team
    OL --> |Assists in the incident response|IC
    classDef incident fill:#f00,color:white
    IC --> |Leads the incident response|Incident:::incident

SOP (Standard Operating Procedure) for an Incident#

An incident should be declared if any of the following is true:

Does the incident affect customers?
Does the incident affect the customer SLA?

To resolve an incident:

Make SRE on-call aware of the incident
Assign incident management roles
IC defines the incident in terms of:
- Impact
- Frequency
- Severity
IC creates an Incident ticket for the incident in the Stakater ticket system
CL informs the customer and keeps them updated every hour of the progress
- Inform customer in external customer Slack channel
- Inform customer via email and add their manager on CC
IC and OL begins understand why it happened
- Always replicate issues with incognito user to avoid using cached content
IC and OL begins address it by involving other teams
IC and OL hands over the ownership when needed if their shifts end
CL creates a document to start analyzing the root cause

To do a post-mortem of an incident:

CL informs the customer that the incident is resolved
IC schedules a root cause analysis meeting, where every involved attends and collaboratively fills out the incident document
IC creates sub-tasks in the incident ticket for follow-up actions