Bloomberg's Data Science Platform was established to support development efforts around data-driven science, machine learning, and business analytics on Bloomberg's many datasets.
The platform was developed to provide a standard set of tooling for the Model Development Life Cycle (MDLC) which includes tools for the early stages of development and data exploration, to experimentation and large scale training, all the way to live inference. Through access to scalable compute and specialized hardware, Data Science Platform users have access to ML training jobs and Inference Services, Analytics and ETL using Spark, and data exploration with Jupyter. The platform is built on Kubernetes, leveraging containerization, container orchestration and a cloud architecture using 100% open source foundations.
What We do:
Model Prediction, or Inference, is the last critical step in the MDLC, when business value of model-driven applications can be realized. Our Inference solution is powered by the open source project KServe, a highly scalable and standards based model inference platform for trusted AI.
Delivering performance to latency-sensitive, throughput-heavy, model-driven applications, means making the right choices from hardware to Ingress. As a member of the Data Science Platform's Infrastructure team with a focus on the Serverless Components, you'll have the opportunity to work on open source serverless technologies underlying KServe, such as Knative and Istio, as well as looking at the latest hardware available on the market to service hundreds to thousands of models in a scalable way.
As the founding members of KServe, we regularly upstream features we develop, present at conferences and collaborate with our peers in the industry, and are in tune with the surrounding Kubernetes community. Open source is at the heart of our team. It's not just something we do in our free time, it is how we work.
We'll trust you to:
- Innovate and design solutions that keep in mind strict production SLA: low latency/high throughput, multi-tenancy, high availability, reliability across clusters/data centers, etc.
- Interact with ML experts to understand workflows, pinpoint and resolve inefficiencies, and inform the next set of features for the platform.
- Collaborating with open-source communities and internal platform teams to build cohesive model deployment experience.
- Automate operations and improve observability of the platform by integrating with systems for metrics and distributed tracing.
- Troubleshoot and optimize ML model inference performance.
- Build tools enabling other engineers a way to debug and understand performance of complicated systems.
What we are looking for:
- Have a passion for providing reliable and scalable ML infrastructure
- Experience designing and implementing low-latency, high-scalability systems
- Experience working in a multi-tenancy and multi-cluster environment
- Experience with ML infrastructure open source project such as Kubeflow, KServe, MLFlow, Feast
- Experience with distributed systems eg. Kubernetes, Kafka, Zookeeper/Etcd, Spark
- Experience with debugging performance issues with distributed tracing and benchmark tools
- Proficiency in two or more languages (Go, Python, C++, or JavaScript) and willingness to learn new technologies as needed
- At least 2 years of experience as a software engineer
We'd love to see:
- Experience with serverless framework or infrastructure, such as Knative, AWS Lambda, Google CloudRun.
- Experience working with Service Mesh, authentication & authorization systems like Spire/Spiffe.
- Experience working with GPU compute software and hardware
- Ability to identify and perform OS and hardware-level optimizations
- Open source involvement such as a well-curated blog, accepted contribution, or community presence
- Experience with cloud providers such as AWS, GCP or Azure
- Experience with configuration management systems (Chef, Puppet, Ansible, or Salt)
- Experience with continuous integration tools and technologies (Jenkins, Git, Chat-ops)
- Passion for education e.g providing workshops for tenants
- Machine Learning the Kubernetes Way - https://www.youtube.com/watch?v=ncED2EMcxZ8
- ML at Bloomberg - https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9810-machine+learning+%40+bloomberg%3a+building+on+kubernetes
- Inference with KServe(Formally KFServing) - https://www.youtube.com/watch?v=saMkA4fIOH8
- Exploring model serving with KServe - https://www.youtube.com/watch?v=FX6naJLaq2Y
- The journey to build Bloomberg ML Inference platform- https://www.bloomberg.com/company/stories/the-journey-to-build-bloombergs-ml-inference-platform-using-kserve-formerly-kfserving/
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.