Job Description:
• Lead and grow a team of ML engineers focused on production ML systems
• Lead model improvements in response to production issues, product feedback, and new research or platform advancements
• Lead production release processes for ML services, including release planning, CI/CD, staged rollouts, and rollback procedures
• Build and operate observability and on-call practices for ML features, including monitoring, alerting, dashboards, incident response, and post-incident reviews
• Develop and maintain scalable evaluation frameworks, datasets, and automated regression tests to prevent quality regressions
• Lead reliability, performance, and cost improvements for inference and serving, including capacity planning and meeting SLAs (latency, throughput, availability)
• Partner with researchers, product, and platform teams to define quality bars and production readiness, including Trusted AI requirements
• Establish and evolve production standards and governance across ML features (testing, evaluation methodology, release gates, model versioning and lineage)
• Partner with platform and product teams to integrate ML capabilities into products
Requirements:
• BS/MS in CS/Engineering or equivalent experience
• Experience building and operating software systems, including production ML systems
• People leadership experience, or strong technical leadership experience (mentoring, setting direction, driving delivery)
• Experience with cloud infrastructure and production observability (AWS, Azure, or GCP)
• Experience with CI/CD, reproducible deployments, and operating services in production
• Strong written communication and documentation skills
Benefits:
• Health and financial benefits
• Time away and everyday wellness