Handle mid to complex incidents related to data center infrastructure including network equipment, servers, and storage systems.
Perform root cause analysis (if possible), troubleshooting, and implement technical solutions to restore services and maintain 99.9% SLA compliance.
Execute hardware diagnostics, component replacements, and system recovery procedures for GPU servers, storage arrays, and networking equipment
Act as escalation point for L1 Operations Engineers, providing technical guidance and mentoring
Collaborate with L3 engineers, vendors, and cross-functional teams on complex problem resolution and infrastructure projects
Develop and improve SOPs, technical documentation, runbooks, and best practices for data center operations.
Conduct post-incident reviews and implement preventive measures to reduce recurring issues.
Job Requirements:
Minimum Diploma/Bachelor's degree in IT, Computer Science, or Electrical Engineering - 24 years of experience in data center operations, infrastructure engineering, or server support
Hands-on experience with enterprise server hardware (GPU servers, storage systems, networking)
Strong Linux skills: troubleshooting, performance tuning, service and user management
Solid understanding of data center networking (VLANs, routing, switching, firewalls, load balancers)
Experience with hardware diagnostics tools, IPMI/BMC, and remote management systems
Knowledge of ITIL framework, incident management, and SLA-driven operations
Able to work independently and act as technical reference for L1 team
Physical ability to perform hands-on work in data center environment
Preferred:
Hands-on experience with cloud platforms (AWS, GCP, Azure, or Alibaba Cloud).
Experience with NVIDIA GPU infrastructure or AI/ML server platforms
Familiarity with liquid cooling systems and thermal management
Experience with automation tools and scripting (Bash, Python, Ansible)
Knowledge of storage protocols (NFS, iSCSI, Fibre Channel)
Certifications in data center operations (DCCA, CDCP) are a plus
Specific for AI Infrastructure Vacancy:
Understanding of high-speed interconnects (NVLink, InfiniBand, PCIe)
NVIDIA certification or equivalent vendor certifications (Dell, HP, Cisco)