Incident Management

9 min read

Restore service fast, minimize impact, and learn without bureaucracy - the ITSMote way.

Incident Management

Incident Management in ITSMote is built to reduce uncertainty and restore service fast. It exists only if it measurably improves reliability, recovery, or predictability.

Definition

An incident is any unplanned interruption to a service, or a reduction in service quality, that impacts users or risks impact.

Not an incident (usually)

A known, low-risk maintenance action with an approved change plan
A long-running product limitation with no current degradation (track as problem or backlog item)

If you're debating - treat it as an incident until proven otherwise.

Goal

Restore normal service operation quickly and safely, with minimal business impact.

Scope

This process applies to:

Customer-facing outages or degraded performance
Data integrity risks
Security-impacting service disruptions (coordinate with security flow if you have one)
Major internal platform outages affecting delivery/support

Roles

Incident Lead (mandatory, single person)

Accountable for the incident end-to-end:

declares severity
drives coordination
ensures comms happen
decides when to close
ensures post-incident actions are created

Never assign "the team" as owner.

Responders

Engineers/support who investigate and implement mitigations.

Comms Owner (optional but recommended for higher severity)

Owns status updates and stakeholder comms so the Incident Lead can run the room.

Severity model

Severity exists for one reason - to align people fast on urgency, coordination, and communication.
It is impact-first, not "how scary the graphs look" and not based on perception or noise.

Rules

Pick the lowest SEV that still matches the real user and business impact.
You can change SEV as you learn more.
Impact beats service count. One service can be SEV1 if it blocks the core user journey.
Data integrity and security risk override everything - treat as SEV1 until proven otherwise.

What is a "critical service"?

A critical service is anything that, if broken, prevents the main user journey or creates immediate contractual/SLA risk. Examples:

auth/login
core API
payments/checkout
trading/execution
account access and key workflows

"Critical" is defined by your product and existing contracts.

SEV1 - Critical

Use SEV1 if any of these are true:

Core user journey is down or unusable for a large share of users
Confirmed or high risk of data loss, data corruption, security breach, or incorrect financial operations
Major contractual/SLA breach is likely, or a predefined priority customer is blocked in a high-stakes way

Typical signals:

No workable mitigation, or workaround does not scale
Situation can spread fast

Required response:

Incident channel + bridge/call
Frequent updates (every 15-30 minutes)
Clear decision log (what changed, why, results)

SEV2 - Major

Use SEV2 if:

Core functionality is severely degraded or unavailable for a meaningful subset of users
System still delivers value in other parts, but business impact is material
Workaround may exist, but is manual, partial, unreliable, or significantly reduces value

Required response:

Incident owner + active coordination
Regular updates (every 30-60 minutes)
Mitigation plan documented and executed

SEV3 - Moderate

Use SEV3 if:

Only non-critical functionality is impacted, or the impact is limited to a small segment
Workaround exists and is acceptable for most affected users
Business impact is contained

Required response:

Owner + ticket
Updates as changes occur (no fixed cadence required)
Fix plan and verification criteria

SEV4 - Minor

Use SEV4 if:

Cosmetic issue, minor bug, isolated cases, low business impact
No risk to data integrity, security, or key contracts
Workaround is easy or impact is negligible

Required response:

Ticket + prioritization (backlog/sprint)

Fast SEV decision flow (for on-call)

Any risk or evidence of data loss, corruption, security breach, or incorrect financial operations? → SEV1
Is the core user journey down or unusable for a large share of users? → SEV1
Is core functionality severely degraded or down for a meaningful subset, but the product is still partially usable? → SEV2
Is impact limited to non-critical functionality or a small segment, with an acceptable workaround? → SEV3
Everything else → SEV4

Incident lifecycle (ITSMote)

This lifecycle is designed to restore service fast, keep communication predictable, and avoid process theater.

Communication is continuous - it starts at Open and continues until Closed.
SEV starts as Primary SEV (fast, signal-based) and is confirmed or corrected during Triage.
Mitigation and resolution are different - mitigation is the first measurable reduction of impact, resolution is the end of incident response.

0) Detect

Signals:
- monitoring alerts
- user reports
- internal observation

Automation rule:
- Every critical alert should auto-create an incident ticket stub (title, service, alert link, timestamp).

Output:
- A clear signal that something might be wrong (with links to evidence).

1) Open (create ticket + start triage)

Goal: enter response mode fast and qualify the signal.

Do immediately:
- Create/open the incident ticket if it wasn’t created automatically.
- Link evidence (dashboards, logs, alerts, traces).
- Start an incident channel/bridge if the suspected severity warrants it.
- Post the first status update: Investigating - impact unknown - next update at <time>.

Record:
- Open time (start of tracking)

Exit criteria:
- Ticket exists with evidence links.
- Someone is actively triaging.

Status:
- Open

2) Confirm or Cancel

Goal: stop guessing - decide whether this is a real incident.

Checklist:
- Is there real user impact or credible risk (data integrity/security/financial correctness)?
- What is broken? What is the user journey impact?
- Blast radius: who/what/where (region, segment, service, key customers)?
- When did it start? Any recent changes (deploy/config/infra/dependency)?
- What is the safest fast mitigation path (rollback/flag/failover/rate-limit)?
- Who must be pulled in now?

Decision (must pick one):
- Confirmed - impact proven or risk is real
- Cancelled - false alarm / duplicate / test / no evidence and no risk

If Confirmed, do:
- Assign Incident Lead (single owner driving the incident).
- Set Confirmed SEV (or keep Primary SEV if already right).
- Write 3-6 lines in the ticket:
- Current understanding (facts only)
- Confirmed SEV (and why)
- Next actions

Record:
- Time to confirm (Open -> Confirmed) or time to cancel (Open -> Cancelled)

Statuses:
- Confirmed
- Cancelled

3) Restore (fix the incident, mitigation-first)

Goal: reduce user impact fast, then get to stable recovery.

Rule:
- After Confirmed, you are in response mode immediately.
- First priority is mitigation if it can reduce impact faster than a full fix.
- Root cause can wait until impact is reduced.

Mitigation options (examples):
- rollback
- disable feature / flip flag
- failover / reroute
- rate limit / shed load
- restart with guardrails (only if safe)
- temporary capacity increase
- isolate bad node / dependency

Rules:
- If it reduces impact quickly, take it - even if it’s ugly. Clean up later.
- Downgrade SEV only after impact is materially reduced and proven by signals.
- Do not downgrade based on optimism. Use metrics.

Exit criteria:
- Impact is reduced or eliminated, proven by evidence.

Transition:
- Move to Monitoring once service is restored (even if the long-term fix is not done yet).

4) Monitoring (service restored, verify stability)

Goal: confirm the service is stable before declaring victory.

Definition:
- Monitoring = service restored and verifying stability, not “we are still investigating”.

Do:
- Watch primary SLO signals (errors, latency, availability).
- Watch secondary symptoms (queues, retries, saturation, backlog).
- Monitor for a defined window (depends on SEV and service criticality).
- If relapse happens: return to Respond and re-evaluate SEV.

Status:
- Monitoring

5) Resolve (finish incident response)

Goal: complete incident response as a process and ensure follow-ups exist.

Do:
- Document mitigation and final state.
- Validate recovery with clear checks (metrics, synthetic tests, user confirmation where needed).
- Create follow-ups for long-term fixes and prevention (Problem/Change/Engineering tasks).
- Ensure each follow-up has a single owner and a due date.

Record:
- Recovered time (TTR) - the moment service is back to normal and proven by measurement.

Status:
- Resolved

6) Close (hard final)

Goal: lock the incident only when it is complete and usable for learning.

Close only when:
- User impact is gone and recovery is proven by signals.
- Ticket has a usable timeline (key events with timestamps).
- Mitigation and outcome are documented.
- Follow-ups exist (or explicitly None with a reason), each with owner and due date.
- Key timestamps captured (Open, Confirmed/Cancelled, Mitigated, Recovered).

Final status:
- Closed

Notes:
- Cancelled incidents should also move to Closed once the cancellation reason and minimal evidence are recorded.

Metrics (minimum viable)

If you don't measure these, you can't claim the process works.

Track per incident:

TTC - time to confirm or cancel
TTM - time to mitigate (impact materially reduced)
TTR - time to recover (service back to normal, proven by signals)
TTClose - time to close (ticket Closed: timeline + follow-ups complete)

Track monthly:

Number of SEV1/SEV2 incidents
% incidents with resolution within target
% incidents with completed follow-ups on time

Documentation standard

Document what you actually did, not what you wish you did.

Minimum incident record fields:

Severity
Affected services
Detect time (if known)
Open time
Confirmed / Cancelled time
Mitigated time
Recovered time
Closed time
Impact description (user-facing)
Resolution details
Prevention plan

Post-Incident Review (PIR)

Run PIR for:

all SEV1
SEV2 with messy response or repeat risk
anything that teaches you a system weakness

PIR outputs (keep it tight):

Root cause (the real one, not "human error")
Why detection failed or lagged (if it did)
Why mitigation took time (if it did)
3-7 concrete follow-ups max:
one owner each
measurable outcome
due date
priority

Rule: if follow-ups are not owned, they don't exist.

If PIR identifies a systemic cause or credible repeat risk, open/link a Problem ticket and track preventive actions there.

Minimal templates

Incident header (copy/paste)

Title:
Severity:
Incident Lead:
Start time (UTC):
Customer impact:
Current status:
Next update time: