Software Engineer, Deep Learning Infrastructure - Stanford, United States - Tesla

    Tesla
    Tesla Stanford, United States

    2 weeks ago

    Tesla background
    Description

    **Software Engineer, Deep Learning Infrastructure - Autopilot**

    ????Engineering & Information Technology????Palo Alto, California?? ID104044????Full-time **THE ROLE:**

    As a Software Engineer within Autopilot, you will work on reinforcing, optimizing, and scaling our neural network training infrastructure.

    At the core of our self-driving capabilities, there are different neural networks that the Deep Learning team is designing to train large amounts of data. Robustly training jobs at scale, should it be for production models or quick experiments, and completing them in the shortest amount of time possible, is critical to our mission.

    **Responsibilities:**

    Write robust Python software code in our machine learning training repository while applying best software practices to support machine learning scientists in tasks such as fetching training data, preprocessing it, and orchestrating the training runs.

    Integrate the training software into our continuous integration cluster to support metrics persistence across experiments, weekly/nightly neural network builds, and other unit / throughput tests.

    Profile performance of training software in our training cluster, identify bottlenecks in and between CPU/GPU code execution, and work on optimizing its throughput and scalability within and across nodes to ultimately reduce convergence time.

    Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning.

    **Requirements:**

    Experience programming in Python and/or C/C++.

    Proficient in system-level software, in particular hardware-software interactions and resource utilization.

    Understanding of modern machine learning concepts and state of the art deep learning.

    Experience working with training frameworks, ideally PyTorch.

    Demonstrated experience scaling neural network training jobs across clusters of GPUs.

    Optional: Experience programming in Cuda.

    Optional: Profiling and optimizing CPU-GPU interactions (pipelining compute/transfers, etc).

    Optional: Devops experience, in particular dealing with clusters of training nodes, and filesystems for very large amount of training data.

    **?????**

    Tesla ?????????????????????????????????????????????????????????????????????????????

    Tesla ?????????????????????????????????????????????????????????????????????

    ???????????????????????????????????????????????????????????????????????????????????????????

    Tesla ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????

    **Software Engineer, Deep Learning Infrastructure - Autopilot**

    ???? Engineering & Information Technology ???? Palo Alto, California ?? ID ???? Full-time **THE ROLE:**

    As a Software Engineer within Autopilot, you will work on reinforcing, optimizing, and scaling our neural network training infrastructure.

    At the core of our self-driving capabilities, there are different neural networks that the Deep Learning team is designing to train large amounts of data. Robustly training jobs at scale, should it be for production models or quick experiments, and completing them in the shortest amount of time possible, is critical to our mission.

    **Responsibilities:**

    Write robust Python software code in our machine learning training repository while applying best software practices to support machine learning scientists in tasks such as fetching training data, preprocessing it, and orchestrating the training runs.

    Integrate the training software into our continuous integration cluster to support metrics persistence across experiments, weekly/nightly neural network builds, and other unit / throughput tests.

    Profile performance of training software in our training cluster, identify bottlenecks in and between CPU/GPU code execution, and work on optimizing its throughput and scalability within and across nodes to ultimately reduce convergence time.

    Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning.

    **Requirements:**

    Experience programming in Python and/or C/C++.

    Proficient in system-level software, in particular hardware-software interactions and resource utilization.

    Understanding of modern machine learning concepts and state of the art deep learning.

    Experience working with training frameworks, ideally PyTorch.

    Demonstrated experience scaling neural network training jobs across clusters of GPUs.

    Optional: Experience programming in Cuda.

    Optional: Profiling and optimizing CPU-GPU interactions (pipelining compute/transfers, etc).

    Optional: Devops experience, in particular dealing with clusters of training nodes, and filesystems for very large amount of training data.