Incident & Problem Manager
Req ID:
7174
Job Description:
Primary Functions:
The Incident and Problem Manager is responsible for the end-to-end management of the Incident and Problem Management processes to ensure minimal disruption to business operations and to improve service quality. The role is crucial in a highly regulated banking environment to support operational resilience, continuous improvement, and compliance with internal and external governance frameworks. This role requires strong leadership, analytical skills, and the ability to coordinate effectively across IT teams to manage incident resolution and drive root cause analysis for permanent fixes.
Duties and Responsibilities
Incident Management
Own the end-to-end Incident Management process, ensuring consistent handling across all IT functions.
Lead and coordinate responses to high-impact incidents, including mobilization of resolution teams and communication with stakeholders.
Drive incident triage, categorization, prioritization, and timely escalation according to defined SLAs.
Lead Major Incident Management (MIM) calls and ensure structured updates to senior stakeholders and impacted business units.
Analyze incident trends to drive service improvement and reduce recurrence.
Ensure accurate and timely documentation of all incidents, including incident timelines, actions taken, communications, impact assessments, and resolution steps.
Maintain an incident log for audit and reporting purposes, aligned with internal governance and regulatory expectations.
Problem Management
Own and maintain the Problem Management process and documentation in alignment with ITIL best practices.
Identify root causes for recurring and significant incidents using structured methodologies such as 5 Whys, Fishbone (Ishikawa), or
Kepner-Tregoe.
Organize and lead cross-functional technical review meetings for problem investigation to drive toward permanent solutions.
Maintain and manage the Known Error Database (KEDB) and validate temporary workarounds or fixes.
Collaborate with Change Management to ensure corrective actions are implemented with minimal risk.
Drive trend analysis using incident data to proactively identify areas of improvement and risk.
Develop and maintain comprehensive documentation for problems, including problem records, root cause analysis reports, known error records, and workaround procedures.
Ensure Problem Management documentation supports audit, compliance, and knowledge sharing objectives.
Process Integration & Governance
Ensure effective integration with key ITSM processes such as:
o Change & Release Management
o Configuration Management (CMDB)
o Business Continuity & Disaster Recovery (BCP/DR)
o Service Level Management
o Capacity and Availability Management
Drive post-incident reviews (PIR) and lessons learned sessions to ensure knowledge is retained and action plans are executed.
Provide periodic reports and dashboards on incident and problem trends, root causes, and improvement initiatives to stakeholders and auditors.
Contribute to audit readiness and compliance with regulatory frameworks such as BNM RMiT, ISO 27001, and ITIL.