Problem Management
Eliminate recurring incidents and reduce risk without turning root cause analysis into bureaucracy.
Problem Management
Problem Management in ITSMote exists for one reason - to stop the same incidents from happening again.
If it does not reduce incident frequency, blast radius, or recovery time over time, it is waste.
Problem Management is not a reporting ritual and not a postmortem factory.
Definition
A problem is the underlying cause or systemic weakness that leads to one or more incidents, or creates a credible risk of future incidents.
A problem may exist:
- after incidents already happened
- without any incident yet (known risk)
Not a problem (usually)
- One-off incident with no realistic chance of recurrence
- Cosmetic bugs with no operational or user impact
- Vague technical debt with no link to incidents or risk
If it does not affect reliability, recovery, or risk - it does not belong here.
Goal
- Prevent repeat incidents
- Reduce blast radius when incidents happen
- Make incident response cheaper and faster over time
Problem Management is long-term reliability work, not real-time response.
When to open a problem
Open a problem if any of the following is true:
- Same or similar incident happened more than once
- Incident impact was high (SEV1, SEV2)
- Root cause points to a systemic weakness
Rule:
If you say "this will happen again" - you need a problem ticket.
Relationship to incidents
- Incidents restore service
- Problems prevent recurrence
Do not block incident closure waiting for a problem to be fixed.
Incident can be Resolved/Closed while the related problem remains Open.
Link problems to all relevant incidents.
Roles
Problem Owner (mandatory, single person)
Accountable for driving the problem to a real outcome:
- drives root cause analysis to completion (ensures it is done and documented)
- defines prevention strategy
- creates and tracks follow-ups
- decides when the problem can be closed
The owner does not have to do every task personally, but is accountable for the outcome and for assigning owners to follow-ups.
Never assign “the team”.
Contributors
Engineers or stakeholders providing analysis or implementing fixes.
Lifecycle (ITSMote)
Problem Management is intentionally simple and slow-paced compared to incidents.
1) Identify
Trigger sources:
- recurring incidents
- SEV1 / high-impact SEV2
- operational risk review
- incident follow-ups / PIR outputs
Do:
- Create problem ticket
- Link related incidents
- Assign Problem Owner
Status:
- Open
2) Analyze (find the real cause)
Goal: understand why this keeps happening or can happen.
Rules:
- Facts over stories
- Systems over individuals
- “Human error” is never a root cause by itself
Typical analysis angles:
- missing guardrails
- unsafe defaults
- weak detection
- unclear ownership
- overloaded components
- fragile dependencies
- manual or undocumented procedures
Record:
- Clear root cause statement
- Evidence (logs, metrics, timelines, configs)
Status:
- Analyzing
3) Control (reduce risk now, if needed)
Optional but important.
Use if:
- The fix will take time
- Risk is high
- Another incident is likely before full resolution
Examples:
- add monitoring or alerts
- add runbook
- add temporary limits or safeguards
- document known failure modes
Status:
- Controlled
4) Fix (eliminate or reduce the cause)
Goal: implement changes that actually reduce future incidents.
Rules:
- Fewer, higher-quality actions beat long lists
- Each action must change the system, not just describe it
Examples:
- automation instead of manual steps
- validation instead of tribal knowledge
- architectural simplification
- safer deployment patterns
- better defaults
All actions must have:
- single owner
- measurable outcome
- due date
Status:
- In Progress
5) Verify
Goal: confirm the problem is truly addressed.
Verify by:
- absence of repeat incidents over time
- improved metrics (errors, latency, recovery)
- successful simulations or tests
Do not close based on hope.
Status:
- Verifying
6) Close
Close only when:
- Root cause is documented
- Preventive actions are completed
- Risk is demonstrably reduced or eliminated
- Related incidents are linked
Final status:
- Closed
Known Error (optional)
Use a Known Error record if:
- Root cause is understood
- Fix is deferred or risky
- Workaround exists
Known Error must include:
- clear symptoms
- impact scope
- safe workaround
- conditions to escalate
- owner
- review date (or expiry date)
If nobody uses the workaround, the record is useless.
Metrics (minimum viable)
If Problem Management does not change these, it is not working.
Track quarterly:
- Repeat incident rate (same root cause)
- % problems with completed follow-ups
- Time from problem open to verified fix
- SEV1 incidents caused by known problems
Documentation standard
Keep it short and actionable.
Minimum problem record fields:
- Problem summary
- Related incidents
- Root cause (clear and specific)
- Risk description
- Preventive actions (with owners and dates)
- Verification method
Rule:
If a new engineer can’t understand the problem in 5 minutes, the doc is bad.
Minimal template
Problem header (copy/paste)
- Title:
- Problem Owner:
- Linked incidents:
- Root cause:
- Risk:
- Current status:
- Preventive actions:
- Verification plan: