Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead.
Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host is Tobias Macey and today I’m interviewing Simba Khadder about his views on the importance of ML feature stores, and his experience implementing one at StreamSQL
- How did you get involved in the areas of machine learning and data management?
- What is StreamSQL and what motivated you to start the business?
- Can you describe what a machine learning feature is?
- What is the difference between generating features for training a model and generating features for serving?
- How is feature management typically handled today?
- What is a feature store and how is it different from the status quo?
- What is the overall lifecycle of identifying useful features, defining and generating them, using them for training, and then serving them in production?
- How does the usage of a feature store impact the workflow of ML engineers/data scientists and data engineers?
- What are the general requirements of a feature store?
- What additional capabilities or tangential services are necessary for providing a pleasant UX for a feature store?
- How is discovery and documentation of features handled?
- What is the current landscape of feature stores and how does StreamSQL compare?
- How is the StreamSQL feature store implemented?
- How is the supporting infrastructure architected and how has it evolved since you first began working on it?
- Why is streaming data such a focal point of feature stores?
- How do you generate features for training?
- How do you approach monitoring of features and what does remediation look like for a feature that is no longer valid?
- How do you handle versioning and deploying features?
- What’s the process for integrating data sources into StreamSQL for processing into features?
- How are the features materialized?
- What are the most challenging or complex aspects of working on or with a feature store?
- When is StreamSQL the wrong choice for a feature store?
- What are the most interesting, challenging, or unexpected lessons that you have learned in the process of building StreamSQL?
- What do you have planned for the future of the product?
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email firstname.lastname@example.org) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Feature Stores for ML
- Distributed Systems
- Google Cloud Datastore
- Uber Michelangelo
- AirBnB Zipline
- Lyft Dryft
- Apache Flink
- Apache Kafka
- Spark Streaming
- Apache Cassandra
- Apache Pulsar
- TDD == Test Driven Development
- Lyft presentation – Bootstrapping Flink
- Go-Jek Feast