Accelerate Your Machine Learning With The StreamSQL Feature Store - Episode 137

Summary

Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead.

Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to dataengineeringpodcast.com/linode today you’ll even get a $60 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host is Tobias Macey and today I’m interviewing Simba Khadder about his views on the importance of ML feature stores, and his experience implementing one at StreamSQL

Interview

  • Introduction
  • How did you get involved in the areas of machine learning and data management?
  • What is StreamSQL and what motivated you to start the business?
  • Can you describe what a machine learning feature is?
  • What is the difference between generating features for training a model and generating features for serving?
  • How is feature management typically handled today?
  • What is a feature store and how is it different from the status quo?
  • What is the overall lifecycle of identifying useful features, defining and generating them, using them for training, and then serving them in production?
  • How does the usage of a feature store impact the workflow of ML engineers/data scientists and data engineers?
  • What are the general requirements of a feature store?
  • What additional capabilities or tangential services are necessary for providing a pleasant UX for a feature store?
    • How is discovery and documentation of features handled?
  • What is the current landscape of feature stores and how does StreamSQL compare?
  • How is the StreamSQL feature store implemented?
    • How is the supporting infrastructure architected and how has it evolved since you first began working on it?
  • Why is streaming data such a focal point of feature stores?
  • How do you generate features for training?
  • How do you approach monitoring of features and what does remediation look like for a feature that is no longer valid?
  • How do you handle versioning and deploying features?
  • What’s the process for integrating data sources into StreamSQL for processing into features?
  • How are the features materialized?
  • What are the most challenging or complex aspects of working on or with a feature store?
  • When is StreamSQL the wrong choice for a feature store?
  • What are the most interesting, challenging, or unexpected lessons that you have learned in the process of building StreamSQL?
  • What do you have planned for the future of the product?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:11
Hello, and welcome to the data engineering podcast the show about modern data management. What advice do you wish you'd received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly Media on a project to collect the 97 things that every data engineer should know and I need your help. Go to data engineering podcast.com slash 97 things to add your voice and share your hard earned expertise. When you ready to build your next pipeline or want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like pulsar to get you up and running in no time. With simple pricing, fast networking s3 compatible object storage and worldwide data centers you've got everything you need to run a bulletproof data platform. Go to data engineering podcast comm slash linode. That's Li n o d today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macey, and today I'm interviewing Simba Khadder about his views on the importance of ml feature stores and his experience implementing one at stream sequel. So Simba Can you start by introducing yourself?
Simba Khadder
0:01:31
Yeah. Hey, I'm Simba Khadder. I'm the CEO and co founder of StreamSQL, we're building a feature store for machine learning,
Tobias Macey
0:01:38
as you said, and do you remember how you first got involved in the area of machine learning and data management?
Simba Khadder
0:01:42
Yeah, so I actually started out working at a small startup back in college and I used to do a lot of hackathons and just had always really been interested in distributed systems. So my first real role was actually at Google, where I was working on a cloud data store. And again like I those love distribute system just because of how messy they are on like our field, it's there's never really a right answer. Everything is always a trade off and on the machine learning and I actually kind of fell into it because one of my friends asked me how him and he was on a team of astrophysicists, and they were trying to like find a planet kind of on outskirts of the solar system. And they were like, Hey, we have all this image data, we need someone like crunch through it, like do you have some time to work on that? And I just kind of picked it up and just learned it on the fly and lots of the same things that made me love distribute systems like how virgin really right answers all the messiness is a lot of creativity that goes into it also exists for machine learning. So I kind of fell into it that way and fell in love with it for the same reasons.
Tobias Macey
0:02:45
And so in terms of the work that you're doing at stream sequel, can you give a bit of a description about what you've built there and what motivated you to start the business?
Simba Khadder
0:02:54
Yeah, so stream sequel is an open source feature store and enables teams to Be able to share, reuse and discover features across teams and models. I mean, generally how it works is like this. First you connect your data sources wherever streaming or batch Kafka, s3, whatever you transform and join them as you wish using SQL. And then you define your machine learning features on top of those sources. From there, you can serve as features or productions, you can generate training sets out of them. We also manage all the versioning and monitoring of the features. And you can even actually include like third party features, like embeddings, or whatever data or whatever. In a nutshell, like the mission of it is to help machine learning teams focus on building models, rather than ml pipelines. And how I got into it. I actually was at a startup I founded a startup before this called Triton and we worked with a lot of media companies. We did all kinds of stuff, we really focus around the kind of B the C subscription space. So those paywalls and stuff about those are us sorry, and so we we did a lot of stuff around personalization, propensity, paywalls, all this stuff that was powered by machine learning. At our peak, we're handling like 100 million monthly active users. And you know, I was looking to the data science teams and seeing how they're doing things. And I realized that most of our time was actually spent building Flink and spark jobs to build out datasets and get features into productions. And the whole process was so chaotic and hard to manage. But the problem was that a lot of the feature engineering like this stuff they were doing was actually what was driving the biggest increases. So it made sense. So at the time, like the term feature store wasn't really a thing. There wasn't really like if you googled feature store, nothing would really come up. And, you know, I googled around and some other companies that talked about similar problems, like Uber had Michelangelo, Airbnb had something called zip line, lift at something called drift and there's all these things like that we saw and we were looking for something open source that was good enough for our use case, couldn't find anything and decided to build it in house. From there. At one point, I really If this was really like something special that would really benefit our current clients and all kinds of other people, so I actually decided to take just that piece of that product, and to roll it out into its own company, which is now stream sequel.
Tobias Macey
0:05:11
And before we get too far into what you've built there, can you give a bit of a background about what a machine learning feature is, and some of the difficulties that people have in working with them in the sort of typical deployment paradigm and typical workflow that they have for building machine learning models and serving them in production.
Simba Khadder
0:05:30
So a machine learning feature, it's kind of funny because it isn't really like it's kind of like a nascent thing, there isn't really like a super clear definition for it. But the way I think about it is a machine learning models essentially a function, it takes an input or inputs and generates an output. So let's say like, I'm Spotify, and I want to recommend some songs for you. The input will probably just be me, the user, and then output should be the recommendations. Now if I just give it that it doesn't really have much to work with the model. If I just Give it like a user ID. So usually what teams do is they'll take a user ID and like, kind of break it down into kind of who the user is the users context, per se. So you might say, hey, what was the last couple songs this user listen to? What's their favorite genre. And all this gives them all context that can use to make a better recommendation. So each of those things are features. So like our favorite genre could be made into a feature for the model. And, you know, it sounds kind of easy, but it's actually a lot harder than that. For example, let's say I want to try to tell the feature, the model, how diverse a user's music tastes is, well, there's an exactly like a equation to like, define, like music tastes. So you need to figure out a way to creatively model that to give it to the model itself. And so that's what feature engineering is, in terms of what makes that hard, like, Why are features hard to so that's feature engineering, that's kind of like the basic part of like, what is a feature now generating a feature you really have to generate in two contexts, various Serving contexts. So for example, I want to recommend a song for a user now, well, then you just need to know the values of all the features at that moment. And many situations, you have streaming data, and you can't exactly especially if you're at Spotify, and that you, you need to recommend that now, like it's relatively low latency, you need to know all these features, which are more or less current value very, very quickly. So usually, things are pre processed. So you use something like fling to process all the streaming data coming in, and maintain all the values of the features so that you can just kind of pull them over a lookup, rather than trying to generate them at recommendation time. So that's the serving part. So there's, that's kind of hard, hard enough. But there's also the training part. So models are trained by in this way, essentially, you think of it as they you give a model and the inputs that used to and you give it the actual output, what actually happened. So like, hey, like, here's all this information about a user at a specific point in time. And here's the What they actually chose to listen to next, then you would give the the model the inputs, you'd make it make a recommendation, and then you'd see you give it the actual value, then it would change itself, according to that to try to be better next time. And that's how it training happens in a nutshell. Now, that means that from the feature generation side, I not only need to be able to generate the features now, I need to be able to generate features at any point in time in the past. So I need to say, so that's like a whole nother problem. And usually, there's a whole nother pipeline there. So you end up with like, all these ml pipelines, some are streaming, some are bad, some are for serving someone for training. And it's all kind of broken up across all these different layers in your infrastructure. And that's kind of the problem that we strive to solve.
Tobias Macey
0:08:45
And so as far as the current approach that most companies use, I know that it's still a sort of burgeoning field where there are more people coming into it and more people building their own off the shelf solutions. So what is the typical approach to handling features and being able to provide them both for training and online contexts.
Simba Khadder
0:09:06
Yeah, so on online contexts usually see one of two things. You it's either the company has enough data or they pre process it, then it will go through Costco whatever for for the Event Bus and then it will be processed by something like Flink or Spark streaming and kept maybe the Cassandra Redis the actual values of the features so that you can just pull them very quickly. So that will be the streaming side venue usually you generate you have a whole nother side. So all the events that come in will also go into s3 or HDFS or some sort of file store. And then you'll have a whole nother pipeline on spark or some other bat streaming system that will generate all the training data. So that's usually what people do today every single feature kind of at the very least get certain twice once for screaming once for batch. And then if you have a mix of streaming and batch sources, you kind of have to double that up again, and you end up with four different pipelines. So that's typically what people do today. So it kind of exists on the Flink and spark and whatever else you use layer.
Tobias Macey
0:10:12
And then in terms of feature store, how does that improve the lives of people working on machine learning models? And who are the sort of main downstream consumers of it? And who's responsible for building it and keeping it healthy?
Simba Khadder
0:10:26
Yeah. So in doing this, I've learned how, you know, so many different companies are so many teams that have to go into like, the ML process, because you need a data engineering team, you need maybe nine depending on how big your company is, you need an IT team level infrastructure up, you need someone to actually generate your features. That could be the data scientists themselves. It could be data engineers, it could be a mix of both. And then you have data analysts and other other teams that also might add features or have things to do with the model themselves. So the way we think about it is rather than having everything at such a low level, we allow people to define First are sources which you can generate from Spark, whatever you can do SQL on the sources so that you don't have to go down, you can make queries and write SQL, regardless of things are streaming or batch. And we just kind of try to unify it with a relatively generic layer two materialized views. From there, you can define features in essentially JSON, like it's just configured. And that means that you have one feature definition that's used across all contacts, whether it's training or serving, or anything else from that feature, you can actually generate your training data sets. So you can give a set of features, you can give it a set of labels, so like the right answer, and then it will generate a training set for you. You can ask, Hey, what are the values of these features right now we'll be able to do that in real time. So the idea is that it lets things exist in one space rather than it being fully across all your infrastructure. And it's defined in a very, very clear way beyond that, we have a feature registry so that you can actually look for you can look at your features first so you can see this basic system full analysis of each of your features, you can also search for other feature features that other teams maybe have used, you can pull a feature from another model, we think of features like the building blocks of your models. And so we built the whole platform to allow you to actually use them in that way, rather than thinking of them in that way, but then having to define them at the spark or Flink level.
Tobias Macey
0:12:21
And with using a feature store, it seems that that provides a sort of concrete interface for data engineers to hand off things to the machine learning engineers rather than having sort of a blurry line between where the responsibilities of one ends and the other begins and having to reach inside each other's workflows to ensure that the overall development and delivery of a machine learning model is able to make it all the way through to production. And I'm wondering what your experience has been in terms of how it modifies your own work doing machine learning workflows, and how it impacts the data engineers who are supplying the raw data that are being turned into these features.
Simba Khadder
0:12:59
Yeah, So one really cool part, I think. So from the data engineer side, a lot of machine learning stuff makes no sense outside the machine learning, for example, like if let's say, I want to know the average price a user has spent on an item, well, what do I give the model? If it's not like user has never bought anything in a database, you just says no. But for machine learning, you might set it as the median, or the me or something else. And so someone on the data engineering side has had this like really archaic set of requirements of like, Hey, this is how the feature has to work I need to generate for training and serving. And so then they have to go and implement that. So this is nice, because it lets them not have to do that. They can just generate all the tables that are kind of generically needed, like a purchase table, or whatever else, that data science team on the other hand, could just plug all that stuff in. Now they have their nice, clean datasets. And they can define their features just more as a configuration. So there's no there's much less code and it's much more oriented towards their workflow. So all the generic feature engineering techniques are kind of built in every One does the same thing. So you know, fill in missing values normalize number from 01, or negative one to one or whatever, remove outliers, all that stuff you have this built in, so they can think, again at the configuration level. And they can just quickly add things without having to know Flink without having to know spark and having to work at that side of the code base. They can kind of work agnostically of each other, which is really nice for both teams.
Tobias Macey
0:14:24
And as far as the user experience of the data scientists and machine learning engineers, you mentioned that there's a registry for being able to view the different features that have been defined, but I'm wondering what types of additional user experience improvements are useful or additional capabilities that are necessary for an effective feature store to be able to be useful and sort of maintain its overall health and utility within the system?
Simba Khadder
0:14:53
Yeah,
0:14:54
so one basic pieces versioning which sounds like it should be a solved problem like versioning feature But the way most people do versioning now is just through get. So it gets as soon as you change spark or whatever code, it just gets version and get and you can roll back in that way back can be really messy, especially if you want to use a single version of the feature. Or if you're using a feature from another team, you might want to make sure that you have the same version all the time. So if they change the pipeline doesn't break your model. So versioning is a big piece. That's something that hasn't really been figured out much for ml features and it's it's a core component of, of our feature store, and every piece has to do with monitoring. Now features are obviously based on underlying data underlying data changes, user behavior might change, for example, you know, now in a lot of people shelter in place. So a lot of behavior in terms of buying things has changed dramatically. And lots of models are probably underperforming, because they might have been trained in a different context. So monitoring allows you to see changes happening, understand what's happening and be able to even retrain. Or, or change your feature accordingly to be able to handle those problems. And then the final piece has to do with training set generation, our training set generations implicit, which means you just tell us like, what is a stream or a set of what actually happened, and maybe a timestamp. And we will generate features at each time stamp and makes up the label. So if building training sets, which is like a core piece of just the workflow, the iteration cycles is as easy as just defining the feature in JSON essentially saying telling the train sojourner, hey, I want to add this feature to this training set. And it just kind of happens, you don't really have to think about where the data is, how can I transform it, all the server parts of it, it just becomes way, way faster to iterate. We also backfill streaming data, which is another really nice feature, which means that even if your features are stateful, like average price spent per user, you don't have to wait three months to build enough data. It just will use historical data to generate the state feature. So all that together comes to just speed up the iteration cycle for ml teams, especially on the data science side, which is really again, one of the core mission statements of stream sequel.
Tobias Macey
0:17:10
And in the discovery piece, what are some of the useful pieces of metadata that should be mapped to a given feature? And what are the options for people defining the features for being able to define that metadata in terms of the structure and content?
Simba Khadder
0:17:26
Yeah, so one part is, is just literally what is the data type, for example, networks, don't take strings, only take floats or numbers. And so you can just filter on that. So there's a piece of like, does this thing even fit into my model? So that's one piece. Another piece is the description, which is just plain text. Now this is nice, because how many times or like how many? And how many companies exist where there's a million definitions of essentially the same thing, like how many items a user bought, like different databases will have different values across an Oregon all the stuffs really messy. There's obviously a lot of tools like Looker and BI tools and other things are trying to fix that problem. We're never part of a solution. So we can, if you build a feature around, you know how many items a user bought. Other people can also use that feature either day, or, you know, add a new version to it. And that creates kind of a source of truth for your features, which is, again, like really, really helpful for ml workflow, especially when you have multiple teams, machine learning, like engineering, like it's especially I mean, especially machine learning is such a new field that a lot of the processes are still being figured out. So getting 50 people to work together on one problem is very, very chaotic. And this becomes a piece where everyone can kind of work together and collaborate and benefit from each other's work and a much easier way
Tobias Macey
0:18:44
and digging more into stream sequel itself. Can you talk through how the feature store aspect of it is implemented and the workflow that it provides in terms of being able to define features and then pull them into the models that you're trying to build?
Simba Khadder
0:18:57
Yeah, so first around, like just appointment, you can go stream sequel IO. And there's a cloud based version that you can use as a free tier. And then you just choose where your cloud is. So if you're on AWS or Google Cloud, you just tell us and we'll host it there. We also have an open source version that you can use. It's slightly less features, but it's still, it's still for me, I use it for my local, local machine learning. And then finally, we have a lot of our clients, we actually deploy it in their cloud directly or on prem or whatever else. So that's the deployment aspect of it. The way it works is three steps. First, you plug in your data. And so that could be maybe you're using Google pops up. So you just say, hey, like, this is a stream of data at Google pub sub, the format is JSON. Once you plug all the data in, you can choose to join and transform and all that data that you're getting in and then define your features defining features. It's, you know, our main API's in Python, and you just define it in what looks like JSON, and that will define your features for you and then you can use it for training sets and for streaming. So then you have two main methods. One is generate training set, you give it a set of features, and a label, which is again, like a home, it could be a stream or a file that has the correct answers. So to use for training, and we'll generate features at the point in time of each of those labels to generate training set. For online features, you just use stream SQL dot, you know, get online features a set of features on entities like maybe the user ID, and we'll just generate that. So it's as easy as connect your sources, define your features, start using it for serving online data and and read and generate and training data sets.
Tobias Macey
0:20:38
And how is the underlying architecture of the feature store implemented as far as being able to pull in the data and integrate it and then being able to create and store the features for being able to be served up
Simba Khadder
0:20:51
yet. So online feature store today is built on top of Apache pulsar flank and Cassandra is the first lady Is, is where events come in and where that shade kind of lives. So although events go in the pulsar, we, we store them forever, so we retain them forever that lets us regenerate stateful features. We also have s3 if you just want to upload the straight up file, or GCS, or HDFS, whatever fat plugs into Flink. We use Flink to, to do the sequel transformation to materialized views and to also generate the training data sets. And then the online features are being constantly processes events come in. And the values are stored in either Cassandra or Redis, depending on the feature and the size of feature set. And so that's how it works today. It used to be on Kafka, and it was actually a lot Messier, just because we couldn't retain all of our data. Well, the cool part about pulsar is it has infinite retention. So every event that comes in, we can keep forever, and we can actually offload the test three to lower costs into increased scalability. Yeah, so that's, that's how that's underlying architecture today.
Tobias Macey
0:21:58
And you mentioned that You started on Kafka. And I know that you've got a fairly detailed blog post about your motivations and the process of making the migration to using pulsar. And I'm wondering, what are some of the other ways that the feature store has evolved since you first began working on it and some of the original assumptions that you had going into it that have been invalidated or updated as you continue to build out the capabilities of the platform?
Simba Khadder
0:22:25
Yeah, um, I think a big piece of it. One thing we've learned is this, how machine learning is changing over time before most features are pretty simplistic. They were just like summations of everything user did, maybe it's normalized. Nowadays, like a lot of people are using embeddings, which are essentially vectors that maintain a lot of data inside of them. So like, you can turn a user and all of our behavior into a single vector. You can do the same for items. It's very, very common to do it for text, like like Google has released all kinds of free training. Text embeddings that people use all the time, or is kind of like a new hot algorithm, but also generates text embeddings from your text. And we have to learn how to kind of make it flexible enough where you could include all of these external third party features, and also be able to still obviously do all the simple, basic features. So that requires a lot of changing. The other piece that was interesting is one of the biggest problems at first for us was around streaming and batch data, because they forced us to kind of create multiple pipelines. So again, like you have to all your past events are stored maybe in GCS or s3, and you'd use spark or something similar to generate a train data set. And that's one pipeline, and then you'd have Flink or whatever else taking in all the streaming data and generating your online features. And there's a whole like kind of juggling act you have to do there too. Every time you want to create stateful feature you want to put a feature in production or whatever else. So that was the core problem we were solving at first. ourselves when you bitstream SQL, and it's obviously a big problem. But we also learned that feature versioning feature sharing, all this other stuff was kind of second order, they just were things that became obvious that we could now do it because of how we decide to implement. And we realized that those value props are actually like, what made this much more interesting for me, and made me actually decide to spin off into its own business, because it kind of becomes a new way to iterate a new way to do data science and machine learning, and a new way to think about features and generalize as fundamental building blocks of your models rather than just these inputs that you use to generate outputs. So that forces us to change the whole model have made us think about what a feature registry would look like, even let us think about, hey, let's say there was no streaming data, we're just working on files and batch data. How can we make this thing valuable? So I think that that that kind of learning and then thinking in terms of what does a feature store really mean beyond just like this, a beyond this abstracting away underlying architecture is where things are really cool and where we had to, like make changes later to, to fit that vision we have.
Tobias Macey
0:25:04
And as far as the capabilities of the platform, I know one of the things that supports is being able to monitor features or freeze them in time. And I'm wondering, what are some of the influences that can cause a feature to drift? And what are some of the signals or metrics that you look at in your monitoring? And how do you define the thresholds for determining when a certain action or remediation needs to happen?
Simba Khadder
0:25:30
Yeah, so features, different types of features have different levels of sensitivity. So if you take the average price a user spent or users have spent on something, if I use is everything anyone's ever bought, and you have a ton of data and might be very, very hard to change that. If you're just taking the top song of the last week or whatever, that's gonna change really, really quickly. So one is first looking at how often do these features change? I mean, how often did they change historically that helps See how because a mouse is trained on at a certain point in time. So once a model is trained, it's kind of used to the features looking as they work having the same sort of standard deviation having a similar mean having a similar kind of statistical look to it. What we're looking for is when things happen that quickly change features that typically don't change, because that means that the mall probably has never seen anything like that. And it could cause the its predictions to become really bad. Again, let's say, you know, I mean, the easiest example I can think of because it's very relevant now is you know, all of a sudden, most of the US goes into shelter in place. Well, every e commerce recommender system will have to have changed dramatically, all of our input features will probably have changed pretty dramatically. The way people buy things has changed the way people browse, you know, fair, fair preference to like if they want to pick up in store to have it delivered. All that stuff does change very, very quickly overnight. And if you were to look at the input features for all those months, They will have changed dramatically. And the mall might have just started spitting out garbage because it just, it's it's not Mauser flexible to an extent. But most models will break down if you change a feature too dramatically, it just has never seen it before. So that's what we're looking for. We're just keeping track of standard deviation keep track of averages. If we see things shift really quickly, in a way that is unusual for that feature, we will just flag it to the user, and they can tell us what they want to do, depending on the feature vectors. freezin say, Hey, you know what, like for now, until we fix this, let's just if we see something, you don't just ignore it or set to the average no value or whatever. Or if they can just retrain, and they might want to actually change their features entirely. Like they might say, hey, our recommender system, we want to add a new feature, which is the last three weeks of data such that it kind of catches the effect of whatever Yeah, so that's, that's things that cause feature drift and how people typically handle it typically just handled by retraining, changing the feature just freezing in place, or just telling it to ignore it. outlier detection, so that you can just like keep it within range.
Tobias Macey
0:28:03
And on the versioning side of things, how do you approach being able to iterate on the different features and ensure that you don't accidentally introduce errors into an existing machine learning model that's actively using a given feature?
Simba Khadder
0:28:18
Yeah. So when you add a new version, you're typically going to train first now, it will, it will kind of give away I mean, you, you usually would not put in new feature into production to like, start serving it unless like you train the whole new model on it. There are exceptions. For example, if you add new outlier detection or whatever else like you might want, you might be doing that to solve something you're seeing in production. The versioning is power comes in two parts. So one piece is this, the versioning ever teams will have to depend on that feature, and you don't want it. If I'm a team, and I have them all and depending on a feature, I don't want another team to change it can underneath me, I won't be able to depend on the current version and they increment it. I can see that and I can decide what I want to do based on what changes they made. The other piece of features is sometimes some feature might look better for in a training set. Like to train all this data feature looks like it's working really well, you put in production, you a B tests, it may be against the old, old model. And you realize that oh, this is actually not doing better in production. And this is part of what makes machine learning messy is even if a model looks like it's better than training an offline, it might actually do worse in online. So it's really common to just AB test your models and online even if it should be better theoretically. So just being able to roll back is really powerful. There also is having like a clear view of how the smallest change of service feature has changed over time, can help people understand design systems that were made, why are things normalized the way they are, it's just like, it's just nice to have your featured version and all in one place, so that you can see what's happened, what's changed, what has small changes that has etc.
Tobias Macey
0:29:56
And as far as the integration process Working with data in stream sequel, you mentioned that the ingest pipeline is built on top of pulsar as far as being able to get data into the given source. And what is the interface for being able to merge data across different topics as it's being ingested by pulsar and then processed by Flink?
Simba Khadder
0:30:20
Yeah. So you can literally just write SQL as as you would expect to. So you could set two dependencies on two streams and just join them using just whatever join you want about runs in Flink. So we, we clean it up, we make it work the way you expect it to. And then, and then we generate your features from that. Now the cool part is the ability to join, like for example, I might have a disavow file of items, every item in my ecommerce store, and I have a stream of what users are doing, like they bought this item or whatever. And they can actually join a file in the stream, which is also a really nice thing to have. It just lets you as a data scientist, not have to worry about the underlying capabilities of your data. tools and have to think about where are these fangs? How do I join them together? You just write SQL as you would expect to. And then we can handle the majority. I mean, the average use case, if we probably handle it, if you're doing something really, really specific, then you know, maybe it makes sense to go down to the Flink level. But on average, most data scientists are just trying to, you know, join, join different parts have different streams, or just kind of add
Tobias Macey
0:31:24
certain extra data to a stream. And then as far as the materialization of the features, how are those stored and how do you handle updates to those and being able to keep track of the different versions in the materialized locations?
Simba Khadder
0:31:41
So there's actually there's a kind of a lot that goes into that. So you can materialize two things really like one is a string, in which case we will take in your sources and then we will generate a new stream and pulser from we'll give it a name such as version, and then same with files like we will or two tables, we will just generate a table and keep up to date in that way. So it's materialization actually exists to us as if it was an A the stream or a native table. The difference is that from your point of view, you can't directly change it, you can only change it sources and that will just feed up through it. If you change it materialization question creating new materialization, etc. And if you have like materializations that depend on our materializations this kind of like a game of we just start at the beginning. It's almost like the airflow DAGMAR like, we just start the beginning update, everything needs to get updated once it's once the stream is set up. And we can generate this all the streams at the time on it. So it's kind of a process there. But the good thing is that it's really simple. So it doesn't really break often where if you tried to be smarter of it, there's so many gotchas involved. We just play as simple as possible, every democratic actualization, we'll just build it from scratch from the sources. And if something's dependent on it, we'll eventually build it from scratch. It's eventually consistent so we'll let you know when it's like ready to new version, but till then, it will just remain the value or main result. version. And that's kind of how we do today. So it's a, it's a, it's a process that we've built on top of airflow to like, make sure all that stuff can happen
Tobias Macey
0:33:08
in terms of the selection process for determining which components to include in the overall infrastructure. What was your guiding principle for determining build versus buy? And what was the necessary set of capabilities for incorporating into your infrastructure?
Simba Khadder
0:33:25
Yeah, I mean, we try to build as little as possible for that layer on our side, our value prop isn't like, when we have things hit certain levels of latency, we hit certain levels of capability, but most tools can handle that like Kafka and pulsar, like you can build a system on top of fever. It's just harder in our and, in my experience, build it with Kafka and wants to build it with pulsar, for example. So our guiding principle is simplicity we want we actually want I know when you look at infrastructure, it seems silly because of how many components are put together, but really, we do want to be something as possible while maintaining Israeli reliability. So everything we did was around. But in fact, like, remember the feature sort itself, we built internally, not because we wanted to, but because we had to, originally at the old start where we built it. So I've always just been a proponent of like, figure out what you do, build that, and use whatever you can off the shelf that you can to get your requirements met. So it's like a TDD kind of thinking of like, Hey, this is what my requirements are, how do I get there as fast and easy as possible? And then the other pieces is simplicity, because the simpler something is, you know, it's like kiss keep it simple. So all of our infrastructure choices when we decide to switch things out. We just look at what would life be like if we had this other piece of infrastructure? And would it be simpler? Not would it be fast or not? Would it be able to handle more whatever unless we hit a point where we need that usually, we just aim for simplicity, almost any especially like the Apache tools like it takes a lot to get to a point where there's Unable to handle what you're throwing at it, especially if you take the time to configure it. And then something is so complicated to configure, then that goes into simplicity problem again, and then maybe it makes sense to switch out.
Tobias Macey
0:35:11
And in terms of the overall landscape of feature stores, what have you used as reference material for determining how to go about implementing it? And what is the overall landscape look like as far as the availability of feature stores for somebody to be able to pick up and use and even just prior art that is not necessarily open source, but at least has some sort of white paper or reference architecture for being able to look at?
Simba Khadder
0:35:38
Yeah, so when we first looked up the problem, we actually landed on a talk that someone from Lyft gave, I think it's called bootstrapping Flink that talk and that gave us an idea of of what the problem was and how we're solving. It also kind of validated in our heads that there wasn't something off the shelf that we could use today. fit exactly what we needed. And then we would have to build it if we want it. So yeah, lift has drift. Which talk is bootstrapping think Airbnb has something called zipline, which actually a lot of our decisions are influenced by how Airbnb did their feature. So one thing that makes their feature so really unique, and something that we also do is that it can generate training data sets, implicitly, you just give it a set of labels, and it will generate the training set for you. Other features, stores don't usually do that. You have to, you can give it a time stamp, and I'll tell you features of the time stamp but it's your job to generate the training set. So that was another piece that made every visa plan really interesting. One of the first to my knowledge, when the first people to talk about a feature store like Uber, they built something internally called Michelangelo that yeah, that they've spoken on as well. And I think that's one of the earliest cases of like, a feature store, like where someone would actually define it as a feature store that was spoken about publicly. So that's exists in kind of the the proprietary domain. None of those are open sourced. And none of those are even like publicly available. You can't really use them. You can just look at their talks and how they talk about them. In terms of overstuff, go jek, which has open source something called feast, which you can check out, they handle a lot of the kind of the middleware like defining features. You can't use it currently to like generate a materialized views, stuff on that sort of discovery etc. So very fat and open source versus company closed hopper hops works that has built a feature service open source parts of it on that you can check out and yeah, so there's like it's definitely becoming a really hot space as love startups raising money now kind of in this space as people are starting to realize that this should be a core piece of infrastructure,
Tobias Macey
0:37:47
and what have you found to be the most challenging or complex aspects of working on or with a feature store as you build out the capabilities of stream sequel and use it for your own work for personal use and Triton
Simba Khadder
0:38:00
Yeah, I think the hardest thing is user experience when you're building. I mean, I would argue most Big Data Tools have a problem where they have to balance, the ability to let you do everything you need to do, while also being simple to use if you just want to use it in the most basic ways, so that's always something that is kind of a constant tension, opening up more stuff, making it more tunable, but then making it much harder to use. So kind of getting that developer experience down so that this becomes because again, the goal is to just let people iterate faster on machine learning. So that's kind of our guiding Northstar. So everything we do is thinking about things from that way. So that means that sometimes, you know we will, there might not be a way to do a feature if you have to do very, very specific, very, very, like maybe super hyper low latency feature survey, like maybe like you don't use stream SQL for it, because we're optimizing for The average use case, which is what the majority people have, which is like, Hey, I have this data set. I need to generate this feature as fast as Flink can do it by default. It's kind of it's kind of how we think of it. So every the most challenging the unexpected parts have been how hard again, like it. It's not, it's not unexpected. But I guess every time you start designing an API or something like that, like, you feel like, Oh, I think I can do this. I think I have a handle on it. But you always find that there's all this other requirements, there's all this stuff that gets added on. And, you know, designing API's and just basic things like how to name things, etc. are really, really hard. It's one of these unsolved problems that every time you think you got it, but every time you know, there's always so much to learn in that space is so much iteration to do.
Tobias Macey
0:39:45
And as far as your own experiences, what have been some of the most interesting or challenging or unexpected lessons that you've learned in the process of building stream sequel?
Simba Khadder
0:39:54
It's I think it just comes down to I think a lot of people think that tools are, especially new tools are being pushed by hype. And people think that Oh, you can just if you create a pipe about something like people just use it, people point fingers, all these technologies that quote unquote, like only exist because of hype, but I don't actually buy that. I think you actually really need to solve a problem for someone, you need to someone needs to be able to use your tool and feel like cool. I love this thing. I wouldn't, I would never not use it, I would always use this thing. And I think just like getting back to your fundamentals of like, What are you trying to do? And are you doing it? How can you do it best? I think it's just like, you always feel like there's these ways you can just pull that off. But really like, I truly believe that over time, like the best product will win eventually. And you know, even like the specifically like one process and one like certain paradigms will eventually float up they just eventually someone will get it right and it will just work. So I think I'm kind of optimistic as like Over time, the best tools will end up being the tools that most people use and that we're moving forward constantly. We're not just like kind of moving around quickly.
Tobias Macey
0:41:08
And then for people who are working on providing data to machine learning teams or working as a machine learning engineer or a data scientist, what are the cases where either using a feature store in general or stream sequel in particular is the wrong choice?
Simba Khadder
0:41:25
Yeah, I think so. One piece is this is currently stream I think it's specific of almost every I think every feature store right now is they don't really handle image video. Audio, images, like is obviously a very core one because a lot of machine learning is image processing. So feature stores very don't really aren't really a space for also with certain types of problems like image, image processing, like the model itself becomes really, really important. Much more important the feature sometimes. So it depends on the problem space. But if you feel that the model is actually the most important piece that's going to drive the most, the most performance gains, then a feature server probably isn't going to do that much for you. If your features are super, super simple, and all you're doing is like constantly iterating on the model. The other piece is, if you're like, like, for example, there are certain models that have to be so low latency, that they will actually put parts of the model or the whole model on the browser, for example. And resolve is like different deployment tools where where you're actually spitting a model across many different beyond just like different servers, like it's like, hey, like part of this model runs on the browser, part of his model runs on the server, etc. When you have stuff like that, where it's like you're hyper optimizing for latency, or something else, especially latency for feature serving, then a feature store is probably not going to be able to hit the links you need. If you're going through jumping through hoops to get it. You know, it will be as fast as like a look up in Cassandra and as fast as like Flink can process but if you're at a point Whereas this needs to be perfect, or it needs to be so, so blazing fast, then yeah, it's not the right tool. You should you should continue kind of building custom Lee, I don't think there's actually any off the shelf tool you could use in that situation. And as you continue to use it and improve the stream sequel platform, what do you have planned for the future of the product? Yeah, I mean, the the core of stream sequel is allowing feature engineering feature generation to be something that is as simple and Unified Process. So, you know, like I said, we're constantly pushing the envelope on how complex of features you generate are, what kind of materializations you can make, what kind of data sources we can handle, and allowing and so everything we're doing is just kind of expanding the feature set, making it easier to deploy making it easier to use, but the guiding stars is making it such that teams as a whole Hole can all work together working and building on machine learning is specifically the feature sets.
Tobias Macey
0:44:05
Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll help you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap and the tooling are a technology that's available for data management today.
Simba Khadder
0:44:20
Yeah, I think unifying streaming and batch I think a lot of processing systems are trying to get there. Like Flink has a batch API spark has streaming. But it's just, we're not there yet. Like we there's so much work to do in that space to like really have a unified processing engine. And I think that is a really big problem to solve. Doing it well, and doing it in a way that solves all the needs and is still useful. And that's where fitting into I think Flink and spark both are kind of getting there. But definitely I wouldn't say that it's as easy as using one or the other.
Tobias Macey
0:44:58
All right, well Thank you very much for taking the time today to join me and discuss the work that you're doing on stream sequel. It's definitely very interesting problem space and as you one that is becoming increasingly necessary as we move more and more of our application logic to machine learning, so thank you for all of the effort you put in on that front and I hope you enjoy the rest of your day.
Simba Khadder
0:45:19
Yes, thank you for having me.
Tobias Macey
0:45:26
Listening, don't forget to check out our other show podcast dotnet at Python podcast comm to learn about the Python language, its community in the innovative ways it is being used and visit the site at data engineering podcast comm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers
Liked it? Take a second to support the Data Engineering Podcast on Patreon!