Opportunity Description
About The Role
We’re looking for a deeply technical, hands-on software engineer to join our on-field Kernel Reliability team. You'll help tackle a critical challenge: improving the reliability of our advanced compute clusters and the underlying inference, training, and internal production services. In this role, you'll work close to the code and design solutions that will scale with our rapidly growing system production and software service offerings. If you have strong fundamentals in systems, debugging, and failure analysis—and enjoy building tools and solving hard reliability problems—we want to hear from you. New college graduates are welcome.
Responsibilities
- Contribute to the technical roadmap and execution for kernel-centric reliability of our internal and customer-facing systems.
- Partner with System and Cluster Operations teams to reduce system and service downtime after failure through tooling, analysis, and hands-on debugging ...
Interested in this opportunity? Apply now through Expertini.
Apply for this Position