Job Description:
- Design and develop the Smart platform and other AI systems, covering all essential agent platform components - agent runtimes, toolchains, memory, orchestration, logging, planning, and sandboxing.
- Collaborate with teams across departments - both tech and non-tech - to land and scale real-world AI use cases.
- Drive internal agent adoption in engineering infra by replacing traditional operations with intelligent agent workflows.
- Work on platform reliability and observability to ensure our systems are performant, debuggable, and production-ready.
- Support operations and development work in a healthy balance - you'll gain experience in shipping features and running real services.
- Manage and operate supporting infrastructure, including vector stores, retrieval systems (RAG), and middleware components.
- Continuously experiment, learn, and bring in the latest advancements from the AI/agent ecosystem into production.
- Continuously optimize system performance through tuning, profiling, and root-cause analysis.
- Design and develop an automated technical operations platform to reduce manual work and improve reliability.
- Drive capacity and resource management, ensuring systems scale efficiently under varying workloads.
- Plan and execute stress tests and load tests to identify bottlenecks, improve throughput, and eliminate redundancy.
- Improve system reliability, availability, and observability through better monitoring, logging, alerting, and incident response practices.
- Troubleshoot complex production issues across application, middleware, and infrastructure layers.
Requirements:
- Bachelor's degree or higher in Computer Science, Engineering, or a related field.
- 2+ years of relevant experience in software development - or fresh graduates with strong fundamentals and hunger to learn.
- Strong programming skills in Python (primary) - familiarity with Golang is a plus.
- Solid CS fundamentals - algorithms, data structures, networking, systems, and architecture.
- Backend experience designing cloud-ready services using databases, queues, caching, etc.
- Familiarity with modern AI agent frameworks like LangChain, LangGraph, or strong interest in learning them.
- Strong problem-solving mindset and ability to navigate ambiguity with curiosity and creativity.
- Willingness to work full stack - we prioritize backend but value well-rounded engineers.
- General understanding of distributed systems and cloud-native principles (e.g. the twelve-factor app), including how services are deployed, scaled, and load-balanced in a containerized environment.
- Passion for building AI systems that people actually use.
Skills below are optional but preferred:
- Experience building AI-powered platforms, assistants, or automation tools.
- Familiarity with agent patterns like ReAct (Reasoning and Action), tool chaining, or multi-agent orchestration.
- Knowledge of RAG systems, vector databases, or prompt tuning.
- Prior experience in developer tools, internal platforms, or large-scale systems.
- Exposure to observability, debugging, incident workflows, or service reliability.