Exercise strong technical judgment and intuition to determine when to escalate issues early (in cases of potential large-scale system impact) versus conducting independent initial investigations.
Communicate clearly, effectively, and transparently with stakeholders throughout all stages of an incident (initial detection, ongoing updates, and final resolution).
Analyze logs, monitoring dashboards, and system alerts to identify root causes prior to involving engineering teams where possible.
Define and prioritize key TechOps metrics, and build operational dashboards that enable efficient emergency troubleshooting without relying solely on DevOps support.
Resolve user-specific issues independently and proactively, reducing the workload on engineering teams.
Identify recurring problem areas or processes that require manual intervention, and collaborate with Product/Engineering teams to drive permanent fixes or automation improvements.
Track and document all incidents, ensure accurate monthly system availability reporting, follow up on post-mortem completion, and ensure all RCA action items are delivered on time by the respective teams.
Requirements
Minimum of 10 years of experience in IT Support, Helpdesk, and/or TechOps.
Strong troubleshooting capabilities, with the ability to clearly distinguish between user-specific issues and system-related issues (backend, frontend, or infrastructure).
Proven experience in Incident Management and effective communication with cross-functional stakeholders, ranging from operational staff to senior management.
Solid understanding of basic monitoring concepts and the ability to analyze logs for initial investigation.
Ability to identify and visualize key operational data (e.g., error rates, latency, system logs) to support rapid decision-making during emergency troubleshooting scenarios.
Demonstrates assertiveness in following up with technical teams and ensuring the accuracy and timeliness of information.
Willing to work from office (WFO) at Mampang Blue Bird Head Office.