Search by job, company or skills

Shopee

[AI] AIGC Distributed Training & Optimization Engineer (Pre-training)

2-4 Years
Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted an hour ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Description:

About Us

Sea Group is establishing a brand-new, strategic AI department. This department is dedicated to exploring the transformative potential of generative AI in revolutionizing human connection, self-expression and communication diversity, and social interaction. We are building the next generation of AI-native applications and a comprehensive Model-as-a-Service (MaaS) product support system. Based on massive multi-country data, we are building a leading multilingual AI ecosystem from the ground up. We look forward to more outstanding talents joining us to build leading Southeast Asian multilingual models and explore innovative AI-native applications.

The AIGC team at Sea AI Department is dedicated to pushing the boundaries of visual synthesis. We aim to achieve industry leadership in high-fidelity portrait and video generation. This team focuses on fundamental research and the scaling of generative models to empower next-generation social and E-commerce platforms.

About the Job

  • Toolchain Development: Design and build distributed training toolchains to support ultra-large-scale AIGC model training.
  • System Optimization: Optimize distributed training performance across computation, communication, and storage layers.
  • Stability & Scalability: Analyze and resolve technical bottlenecks in the training process, specifically focusing on improving training stability and efficiency.
  • Frontier Research: Track and explore cutting-edge distributed training technologies, leading project planning and production-grade implementation.

Requirements:

  • Master's degree or above in Computer Science or related fields Bachelor can be considered with a strong industrial experience.
  • Minimum 2 years of relevant experience.
  • Distributed Expertise: Deep understanding of distributed training principles (Data/Pipeline/Tensor/Expert Parallelism) with proven hands-on experience.
  • Framework Proficiency: Expert in deep learning frameworks such as PyTorch, DeepSpeed, and Megatron-LM.
  • Low-level Knowledge: Familiar with GPU hardware architecture and CUDA programming experience in CUDA kernel development/debugging and familiarity with NCCL and cuDNN.
  • AIGC Background: Understanding of AIGC pre-training methodologies, Transformer architectures, and Diffusion models (e.g., Stable Diffusion, Flux).
  • Core Competency: Strong problem-solving skills, innovative thinking, and excellent team collaboration/communication skills.

More Info

Job Type:
Function:
Employment Type:

About Company

Shopee Pte. Ltd. is a Singaporean multinational technology company that specialises in e-commerce. The company was launched in Singapore in 2015, before it expanded abroad. As of 2021, Shopee is considered the largest e-commerce platform in Southeast Asia with 343 million monthly visitors.

Job ID: 146653367