Search by job, company or skills

stedi talent solution

Senior SRE (GPU Infrastructure)

5-7 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Company Overview:

Our client is a US based AI platform for construction businesses that helps streamline and automate processes, increase efficiency, and improve overall project management. They are a rapidly growing company with a strong focus on innovation and providing the best solutions for their clients.

Job Summary:

We are seeking a highly skilled and experienced Senior Site Reliability Engineer (GPU Infrastructure) to design, build, and operate large-scale GPU infrastructure powering cutting-edge AI and machine learning workloads. As a Senior SRE, you will be responsible for ensuring the reliability, scalability, and performance of GPU clusters used for model training and inference, working closely with engineering and research teams.

Key Responsibilities:

  • Design, build, and manage large-scale GPU clusters for AI/ML workloads
  • Lead Kubernetes-based infrastructure operations, including cluster upgrades, scaling, and GPU scheduling
  • Implement and maintain Infrastructure-as-Code (IaC) using tools such as Terraform, Helm, or Ansible
  • Develop and optimize CI/CD pipelines and deployment workflows
  • Build and maintain observability systems (monitoring, logging, alerting) using tools like Prometheus, Grafana, or Open Telemetry
  • Optimize system performance, including networking (e.g., InfiniBand/RDMA) and GPU utilization
  • Ensure high availability through incident management, on-call support, and post-mortem analysis
  • Collaborate with engineers and researchers to support distributed training and model-serving systems

Qualifications:


  • Bachelor's or Master's degree in Computer Science, Engineering, or related field
  • 5 - 7 years of experience in Site Reliability Engineering / DevOps in production environments
  • Strong experience managing Kubernetes at scale
  • Proficiency in Python, Bash, or similar scripting languages
  • Hands-on experience with Infrastructure-as-Code tools (Terraform, Helm, Ansible)
  • Proven track record of managing large-scale, highly available systems
  • Willing to work outside of Indonesian normal working hours
  • Indonesian Citizen

If you are a highly motivated and experienced SRE looking to join a dynamic and innovative company, we would love to hear from you! Please submit your application today.

More Info

About Company

Job ID: 145232949