Senior SRE (GPU Infrastructure)

stedi talent solution

Indonesia

5-7 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

Company Overview:

Our client is a US based AI platform for construction businesses that helps streamline and automate processes, increase efficiency, and improve overall project management. They are a rapidly growing company with a strong focus on innovation and providing the best solutions for their clients.

Job Summary:

We are seeking a highly skilled and experienced Senior Site Reliability Engineer (GPU Infrastructure) to design, build, and operate large-scale GPU infrastructure powering cutting-edge AI and machine learning workloads. As a Senior SRE, you will be responsible for ensuring the reliability, scalability, and performance of GPU clusters used for model training and inference, working closely with engineering and research teams.

Key Responsibilities:

Design, build, and manage large-scale GPU clusters for AI/ML workloads
Lead Kubernetes-based infrastructure operations, including cluster upgrades, scaling, and GPU scheduling
Implement and maintain Infrastructure-as-Code (IaC) using tools such as Terraform, Helm, or Ansible
Develop and optimize CI/CD pipelines and deployment workflows
Build and maintain observability systems (monitoring, logging, alerting) using tools like Prometheus, Grafana, or Open Telemetry
Optimize system performance, including networking (e.g., InfiniBand/RDMA) and GPU utilization
Ensure high availability through incident management, on-call support, and post-mortem analysis
Collaborate with engineers and researchers to support distributed training and model-serving systems

Qualifications:

Bachelor's or Master's degree in Computer Science, Engineering, or related field
5 - 7 years of experience in Site Reliability Engineering / DevOps in production environments
Strong experience managing Kubernetes at scale
Proficiency in Python, Bash, or similar scripting languages
Hands-on experience with Infrastructure-as-Code tools (Terraform, Helm, Ansible)
Proven track record of managing large-scale, highly available systems
Willing to work outside of Indonesian normal working hours
Indonesian Citizen

If you are a highly motivated and experienced SRE looking to join a dynamic and innovative company, we would love to hear from you! Please submit your application today.