Summary
Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm interviewing Razi Raziuddin about how data engineers can empower data scientists to develop and deploy better ML models through feature engineering
Interview
- Introduction
- How did you get involved in the area of data management?
- What is feature engineering is and why/to whom it matters?
- A topic that commonly comes up in relation to feature engineering is the importance of a feature store. What are the tradeoffs for that to be a separate infrastructure/architecture component?
- What is the overall lifecycle of a feature, from definition to deployment and maintenance?
- How is this distinct from other forms of data pipeline development and delivery?
- Who are the participants in that workflow?
- What are the sharp edges/roadblocks that typically manifest in that lifecycle?
- What are the interfaces that are needed for data scientists/ML engineers to be able to self-serve their feature management?
- What is the role of the data engineer in supporting those interfaces?
- What are the communication/collaboration channels that are necessary to make the overall process a success?
- From an implementation/architecture perspective, what are the patterns that you have seen teams build around for feature development/serving?
- What are the most interesting, innovative, or unexpected ways that you have seen feature platforms used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on feature engineering?
- What are the resources that you find most helpful in understanding and designing feature platforms?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/ rudderstack.
Your host is Tobias Macy, and today, I'm interviewing Razi Raziuddin about how data engineers can empower data scientists to develop and deploy better ML models through feature engineering. So, Razi, can you start by introducing yourself?
[00:00:54] Unknown:
Absolutely. Thanks, first of all, Tobias, for having me on the podcast. Very excited to be here. I'm the CEO and cofounder of Featurebite. We're a Boston based startup, focusing very much on scaling enterprise AI primarily by radically simplifying feature engineering and management. So my, you know, I've been in the Boston area for now over 2, sort of, data and analytics, bringing innovative data and analytics products, to market. Had the good fortune of being on the leadership team of, 2 different unicorn startups. I was at their robot most recently where I helped scale the the company from just around 10 employees, when I first joined to 850 employees, in under 6 years, which was just an amazing amazing ride.
And prior to that, I was, at another unicorn before the term was used for, you know, describing, interesting startups. I was at a company called Netezza, which is also Boston based. Did an IPO, and then was sold to IBM for close to 2, 000, 000, 000 in in 2010. So just being in the space has been fascinating. It's, evolving every other day. There's just something new to learn. You know, it's a very exciting space, so look forward to the conversation. And do you remember how you first got started working in data? So I moved to the Boston area with, a company called EMC, which is now part of, Dell, and, that was in the data space. I was fascinated with with data for 1 reason or the other, right from grad school. So I I wanted to be in the data space. And then I really got hooked into analytics, and data when I was at Netezza.
I got the firsthand view of, basically how data and analytics helps drive businesses, and saw the power of analytics and how it can be used to truly create a a moat around businesses. You know, so our customers were just fascinated with the kinds of things that they were do they they were able to do with with the data. And then, you know, moving into the robot, you know, the power of ML was just, truly, truly amazing. You know, it was just, again, just amazing to see what, data analysts and business analysts were able to do with AutoML, or they were able to create just super interesting models by the, you know, just uploading their data and and clicking a few buttons here and there.
And it's just, again, it's fascinating to for for me to be at this sort of convergence of business and technology, which is really where analytics lives. And then you look at the space. It's just, you know, there's something new happening every other day. It's just so exciting. There's so much innovation.
[00:03:51] Unknown:
So you feel you'll learn something new literally every single day. Absolutely. That that's definitely how it's been with running the this podcast and my other shows and just working in the space for the past, I think decade plus at this point. Focusing in on the topic at hand for today, we're talking about feature engineering. And before we get too far into that, why don't we just start by giving a recap of what even is a feature and how is it distinct from a table in a warehouse or, an ML model or any of the differ any of the other data assets that we produce as data engineers?
[00:04:26] Unknown:
No. Absolutely. Yeah. That's, that's a great question. I think the the the best place to start is just, you know, the fact that great, AI needs great data, for it to be successful. Great AI starts with great data. And features are the the data elements that get fed into algorithms to to train machine learning models and to be able to do predictions off of these machine learning models. And so the, you know, you can think of features as just elements of, you know, different entities. So there are attributes of different entities. Good examples are, demographic information associated with, for example, a customer or a product.
So age, sex, gender, etcetera, those are, you know, simple features that can just be derived from, data that exists around a particular entity. But features can get super complex, and a lot of the the features actually get derived from raw data. And so, you can think of, for example, in the the case of credit card fraud as an example, knowing that I'm sitting in Boston, while my credit card is being swiped in, I don't know, Moscow or Timbuktu. That's a pretty good indication that there's some kind of fraudulent transaction happening. But that's not something that's intuitive to machines. And so the data that's available needs to be represented in a form that algorithms can easily learn from. So in this particular case, the example that I just quoted, a data scientist would have to, kinda derive that feature by calculating the distance between where credit card holder like myself is sitting and where the transaction has taken place. Now the, you know, the the distance, which is now measured in, let's say, you know, thousands of miles, helps the algorithm, helps the the machine learning model understand something about that transaction, and, the model is able to flag it as potentially fraudulent. Right?
This is again a pretty simple example. Features can get super complex, especially when you're trying to represent, you know, purchase behaviors and patterns. For example, are you a regular shopper or a binge shopper? Differences between shopping and behaviors and patterns associated with a particular age or demographic. What are the kinds of things, you know, you like to buy? How consistent you are? How diverse, the the the purchase patterns are? Those are the kinds of things that constitute, features. And the more complex they are, the more signals they carry within them in order to, again, train models as well as, be able to derive some really good predictions out of those models.
[00:07:21] Unknown:
And so in the process of developing a feature, the term feature engineering comes up a lot. And I'm wondering if you can talk through what that is, why it matters, who cares about feature engineering, and in what ways is it distinct from some of the other types of data engineering pipelines that folks might be familiar with. No. Absolutely. And so so the the the process of basically taking raw data, which, exists in, you know, data warehousing data warehouses and data lakes
[00:07:52] Unknown:
into these features. That process is feature engineering. So it's a it's a it's a critical part of the whole machine learning life cycle where you're transforming your data into, your raw data into data that can be easily fed into machine learning algorithms for for training and for predictions. So as as far as, you know, how feature engineering and, you know, feature pipelines are different from, you know, traditional, data pipelines, that's a topic that, you know, it's, it all depends on how much time you have and how much time you can spend on on talking about that. But, you know, happy to to to dive into it. I think 1 of the most common misconceptions in especially in the data engineering world, is that a pipeline is a pipeline at the end of the day. Right? It's just taking data from 1 place to another, doing some transformations.
And, you know, when you think about features, you know, they they are transformations in some way. Conceptually, they're very similar to doing ETL type of transformations or ELT transforms. The the challenge is, and, you know, this is something that just, you know, helps, sort of get across the difference between, you know, traditional BI and analytics pipelines and, machine learning pipelines. The fundamental difference is, you know, b I is for humans and for human consumption and AI is for machines. Right? And so, you know, that fundamental difference leads to all kinds of complexity in future pipelines. So future pipelines, when you dig underneath, they're at least an order of magnitude more complex, if not even more so compared to, you know, your traditional BI and analytics pipelines.
If you think about the scale, each machine learning model can easily utilize hundreds of features whereas dashboards may have, you know, 5 to 10 metrics max, again, for more human consumption. When you look at the data volumes, you know, typically, your model training requires lots of detailed historical data, which means you're dealing and doing a lot of complex computations on literally large volumes of of data, which makes data movement, through these pipelines super expensive and inefficient. The computational complexity is is much higher in in in features as opposed to calculating metrics.
Metrics are typically simple, statistical functions that involve, let's say, 2, 3, maybe 10 columns. When whereas when it comes to features, you could have many different tables, many different columns being joined together, and super complex computations, in a time aware manner being done. And so the the time awareness, time travel, the need for consistency between training and, serving, all of those things just make the ML pipelines and feature pipelines super complex and very difficult to sort of build, manage, govern, deploy, and just make sure they're healthy and and operational.
[00:11:16] Unknown:
In terms of the feature itself, it it's seems like it's distinct from the output of an ETL pipeline in that the ETL pipeline, once it's produced the output, that output is static until the next time the pipeline runs. Whereas the feature sounds like it's more actually the definition of a function that is used to compute the actual value at request time versus just loading something that has already been computed because of the fact of what you were saying of needing to be able to have this historical look back of what are the values over time.
So a single feature is actually the function that's used to compute those values at whatever time span is being requested by the training operation.
[00:12:01] Unknown:
No. Absolutely. Yeah. I think that that's, that's again a a super important point, Tobias, which is, you know, you you're you're constantly computing features as new data arrives. So there are, you know, jobs that are running in the background, basically computing features, making them available, for predictions that typically require, low latency, much lower latency compared to your, BI report or a dashboard. And so that again increases the complexity of, when data is becomes available and when, you know, how long it takes to to actually compute the feature and make the results of the feature available for, consumption by, different applications out there. And another
[00:12:49] Unknown:
term that comes up often in conversations where features or feature engineering are present is the idea of a feature store or a feature platform. I'm wondering if you can talk through your views on the necessity of a feature store in relation to the computation and delivery of features, and what are the trade offs for having that be its own dedicated architectural or infrastructure component?
[00:13:16] Unknown:
No. Absolutely. Yeah. Yeah. I think, from from from my perspective, feature stores are absolutely critical, component of the the machine learning architecture. You know, you need the ability to deploy pipelines consistently and quickly. That is something that, feature stores make available. The also, the the the the need to serve, features at low latency, that's also something that is is a critical element of what, future stores deliver, for machine learning pipelines and architectures. I think that, it's it's really interesting you bring up the the point around whether, feature stores need to be a separate component or a separate infrastructure altogether, versus being integrated into, the the modern data stack.
For for us, you know, the the if you look at traditionally, given the the complexity of the compute as well as the the the scale requirements, you you pretty much had to go off and build a feature store that was completely separate and a dedicated infrastructure to deal with the processing and, serving requirements for features. But this ends up creating all kinds of challenges. You you now have to manage, you know, build and manage a completely separate infrastructure, which means additional cost and and and separate teams, that are dedicated to to to, managing this infrastructure.
Data privacy and governance just becomes a huge concern. Features are, you know, they carry lots and lots of signals of associated with highly sensitive data, and not being able to manage that in an environment where the rest of the data is managed is just, becomes hugely challenging. And then, you know, if you have a separate platform, a separate environment for processing data, it just leads to data inconsistency as well. Because data that's, landing in your data warehouse may be slightly different from, you know, when it gets pulled into a completely different environment. So for from our perspective, if you look at, you know, what, the the modern data stack looks like and and the power, you know, just that's just available in modern cloud platforms like the the Snowflakes and the the Databricks and BigQueries of the world.
There's just so much, compute power available now, especially with the separation of compute and storage. And plus the the kinds of capabilities that are being built into these platforms just makes it very easy to to run super complex computations on on on massive data sets. So it just makes sense now to build feature stores in the data platform itself. Just push the compute to where the data lives instead of having to move the the data into a completely separate environment. It just, you know, automatically leads to better governance, better utilization of, infrastructure that already exists. You know, just mitigates some of the privacy concerns, increases consistency of the data. And, you know, there are going obviously, going to be certain situations where the the current data platforms, they just don't have the ability to, you know, just do low latency serving as an example or acting as a really good cache or, being able to process embeddings from NLP and and, LLMs, very efficiently. In those cases, you just extend the the modern data stack instead of building a completely separate infrastructure.
That's kinda our view, at at least for what customers should be thinking of building.
[00:17:10] Unknown:
Digging more into the special case of feature engineering as compared to ETL workflows, I'm wondering if you can just talk through what is the overall life cycle of a feature going from the, ideation of what it should be through to the definition to deployment and maintenance.
[00:17:30] Unknown:
Yeah. Absolutely. Yeah. This is, you know, this is something that yeah. I'll I'll talk about the the life cycle of the features and also touch upon where things are, kind of broken, from our perspective. So the the life cycle itself is, you know, you create features, you run experiments on the features. So you build models to understand which features are, the most interesting given a particular use case and the type of model you are trying to create. And then you serve those features, for training as well as primarily for predictions, and then manage and govern, the features, monitor the health of the pipeline. So if you just, you know, sort of dig into each 1 of those, you know, the process of just coming up with the right features, that requires a lot of domain knowledge and deep understanding of the data.
And it requires a lot of data science and coding skills to extract the right signals from the data itself. Right? So you just to create the features, you need to have, the data scientists and domain experts and SMEs sort of working together. And then when it comes to experimentation, it's about creating a very accurate view of historical data to train models, the process that's known as backfilling, and then being able to, you know, train the models themselves to understand which features are important. Thankfully, you know, when it comes to training models, there's all kinds of tools available, AutoML tools out there that just make it very, very easy for, you know, even non data scientists or citizen data scientists to go off and and build models very efficiently.
The 1 of the key areas of challenge comes up when you're trying to sort of serve, you know, build data pipelines to serve features for inferencing or for predictions. That's where you need a lot of help from data engineers to build machine learning pipelines. And that interaction between the data scientists and data engineers is fraught with friction. It just slows down the entire process. We've seen many situations where it literally takes weeks, if not months, to go from when the, you know, features are even created and the experimentation done to actually deploy features in production.
And that just, you know, it it's it's been 1 of those things where we, you know, see that as a major sort of stumbling block in being able to truly scale enterprise AI. And then finally, you know, you've got management and governance. You know, this is becoming increasingly important even for, nonregulated industries where it's important to know sort of, who touched what data, who has access to which data, ensuring that, the health of future pipelines as data is constantly changing as well as being able to manage the cost associated with, you know, the the number of features just exploding in different environments, and ultimately governing, you know, which features actually get deployed, where they get used, and by who.
All of that needs to be sort of centrally managed and governs as far as possible. So that's the life cycle of of of features and sort of, you know, some of the challenges associated with the with the life cycle.
[00:21:10] Unknown:
In terms of the sharp edges or roadblocks that manifest in that overall life cycle, what are the biggest pain points that you see teams running up against? And what is the role of the data engineer in addressing those, maintaining those, enabling other members of the team and other participants in that process to get their work done without having to ask the data engineer for permission or to do a particular step every time they need to get something done. Yeah. I think you you, you know, you're you're
[00:21:45] Unknown:
pointing to an area that's, as I was mentioning, is just, you know, truly fraught with with friction in some ways. 1 of the key challenges with this whole life cycle is it requires 3 different skills. You need, the the domain knowledge. You need data science expertise. You need data engineering expertise to come together. You know, you need domain knowledge to be able to determine, okay, what signals to extract from the data itself. You need data science knowledge to extract those signals and then ultimately, data engineering, expertise to be able to perform all of these operations at scale.
And the challenge is, you know, these skills tend to live in different teams within a given organization. They use different tools. They speak different languages. They don't necessarily understand each other. And so you can, you know, just, pretty much see where the the the points of friction are. Right? It's when, you know, you have 3 completely different teams trying to interact with each other, it just makes that entire process super complex. And then, you know, honing in on, the interaction between data scientists and data engineers, that it just is is is a huge problem in of itself.
Even though, you know, both these personas work on data, the the way they approach problem solving and the the the way they approach the the data itself is just vastly different. You know, it's, it's almost like the the old Mars and and Venus book, where data engineers or, you know, take your pick. Data engineers are from Mars and, data scientists from Venus or the other way around. But, you know, the fact of the matter is that data scientists are experimenters. Right? They work on open ended problems. They like to continually iterate on data, run bunch of experiments, whereas data engineers are they're builders.
They work on, you know, well defined specifications to build pipelines that, you know, are are healthy. They're easily maintainable, and they do what they're supposed to do. You know, they're the opposite of experimenters. And so when these 2 personas come together, it it obviously creates, you know, huge friction, to to to get things done. And then, you know, when when when we look at this space out there, there's a, you know, dearth of tools, to make data scientists a lot more self reliant, when it comes to to to data, and and when it comes to doing, you know, using and preparing data for for modeling.
And for data engineers to be able to create, you know, just these, ML ready pipelines, incorporating some of the key requirements such as time awareness, low latency serving, etcetera. And I think it just exacerbates the the challenge and the the the friction in the whole process.
[00:24:45] Unknown:
For data engineers who are working in a team where data scientists or ML engineers are going to be the primary, definers and maintainers of features, what are the interfaces that they're going to be looking to use to more closely integrate that process into their workflow and their tool chain?
[00:25:04] Unknown:
Yeah. That makes sense. You know, I think the for for for for us, for, you know, just looking at it from from a perspective of the data scientists, I mean, ultimately, data scientists are the ones who come up with features. They are the ones who are using these features for building models and they're ultimately responsible for the overall predictive power and, the health of the models and by association, the health of the features that get built. And so, you know, from from from my perspective, data scientists are the ones who should ultimately be responsible for the entire, future life cycle.
And data engineers or ML engineers should be there in a support function to create the the right environment and create the right architecture that makes it easy for data scientists to consume data, build pipelines, deploy, you know, to do the entire life cycle in a self-service manner. Right? And so when it comes to the interfaces, you know, when if we sort of double click on this and look at each of the, you know, 4 steps I described in the future life cycle, It'll be interesting to dive into that and, you know, sort of look at what those interfaces would actually mean.
When it comes to feature creation, it's as I was mentioning, it's it's super important for data scientists to understand the domain and the context behind the the data. So some way of making it easy for data scientists to interact with the domain, experts and subject matter experts, and some way of, sort of creating a common language so that the interaction becomes much easier is is super important. Give you an example, you know, when it comes to to different signals and signal types associated with features, both data scientists as well as your domain experts understand the RF and recency, frequency, and monetary metrics. Right? That is 1 way, so extending RFM into different dimensions such as, diversity of purchases, similarity of behavior, etcetera, those are ways in which you can make the interaction between SMEs and data scientists a whole lot easier.
Right? The, rise in popularity, for example, of semantic layers, that's, you know, that's primarily being, driven by the need of having data professionals and subject matter experts and business professionals, come together and have a common way of sort of interacting and speaking with each other. That, again, will help this this this entire process of interacting with the, subject matter experts. The the other, challenge that we see over and over again is the you know, just having to write tons and tons of code for doing the the same sort of things over and over again. Alright? So it's, instead of having to write tons of Python codes to, create features, it should just be a simple declarative language that makes it very easy to, you know, take even some of the more state of the art functions that, for example, Kaggle grandmasters use to to win Kaggle competitions and just be able to use those, and have all the the the back end SQL or Spark, auto generated, which will make the entire process of not just feature creation, but also the experimentation and pipelining a whole lot easier. Right? So so so that's when it comes to feature creation itself. If we get to experimentation, there's a huge reliance on data engineers to give data scientists different cuts of data, you know, different slices of data on which, you know, they can do their experiments in in a silo.
I think that process again is is sort of fraught with challenges and risk. 1st, there is, you know, an obvious reliance on data engineering teams just to, you know, pull data from a data warehouse or data lake and dump it into a different environment. But also the fact that the your experimentation is being done in a silo in a different environment as opposed to where the model or or the the the the feature pipelines are ultimately going to get built. That just leads to potential errors and inconsistency in the data. Right? And so, you know, just having the ability to actually run your experiments on live data is super super helpful for data scientists. It just shortens the the the the time between coming up with ideas and actually running experiments, and being able to do that again in an environment where the future pipelines are ultimately going to live. And then, you know, for this, we see a lot of, sort of reinvention of the wheel going on.
So if, there's a really powerful feature catalog that exists that makes it easy for data scientists to come in and, you know, not have to recode everything that's already been done, reuse what's already out there. You know, that is just a a super efficient way of sort of, you know, and an interface that makes experimentation super easy. Moving on to serving, you know, this is as we were discussing earlier, this is 1 of the most painful parts where it it's it's a a very error prone and sort of time consuming cycle. And, you know, for for for me, the the, a lot of the the challenges come from taking, features that have been created in Python and translating them into Time Aware SQL and Spark. Right? And that process, there's no reason why that should be a manual process. Right? That's with, you know, these elements and generative AIs being able to even understand language. We should just be able to very easily and simply take features that have been declared in in Python and create the equivalent spark or or or SQL, that run that's appropriate for the for the right back end, and deploy those pipelines and, have the the the the jobs associated with those pipelines run-in an automated fashion. Right? And ultimately, you need, really good management because these features, they they carry sensitive information.
They're sort of super curated pieces of data that need to be managed very effectively and efficient efficiently. And, ultimately, any, you know, good self-service environment just needs to have the the right oversight in order for smooth functioning, and being able to to to run the right way. And so having, you know, giving the ability for the organization to sort of centrally manage and govern and monitor the health of the pipelines, the cost associated with these different features, giving them the ability to, manage the privacy and and sort of access controls around features and feature pipelines, that becomes super important as well.
That I think, you know, sort of creating that environment will truly sort of, you know, unleash the ability for data scientists to just do a whole lot more. It's it's 1 of those things that's hold you know, that holds data scientists back when it comes to enterprise AI, and we feel like this is something that's just gonna 10 x the the ability for data scientists to to just go off and build more models, deploy them, manage and monitor, and keep updating them as there's new data that arrives, as there are changes that happen, in the market and the market conditions themselves. And in terms of the
[00:33:12] Unknown:
creation of these platforms to enable feature engineering, what are the foundational components that need to be in place before the engineering team could start to think about building out these higher levels? And in particular, I'm interested in the data discovery aspect for the ML engineers and data scientists who want to define features being able to understand what are the pieces of information available to them for, doing that exploration and experimentation?
[00:33:41] Unknown:
Yeah. I think it it's a it's a it's a really interesting question. I think, you know, just having, obviously, as good quality data, as available is is a very important prerequisite for doing any kind of analytics, whether it's, you know, your traditional BI or doing ML, on the data itself. Now I'll also caveat that by saying that no 1 has very good, data. Right? I I don't know of any team out there that'll claim that their data is perfect or close to perfect. There are always issues with the data. And so, you know, this is always a a journey to get to better and better data and and high quality data over time.
But that's, you know, that even with decent quality data, you know, you can do a whole lot and and and that's an important prerequisite. The same thing applies to, sort of the understanding of the data. You know, the if you look at data dictionaries and, the, sort of data catalogs, There's just tons of, information that's always missing. And when it comes to, data science, the the more, you know, the more, variety of data that exists, the better it is for modeling, and for for ML purposes. The the, again, you know, it needs to be an iterative process where, you start with the data that's fairly well understood. You build some models, and then you continue to run experiments with data that perhaps is not as well fleshed out, is not as well understood, just to work with, the data SMEs and business SMEs and, gain a better understanding.
And, you know, it just becomes a a cycle where over time, the need from, the business as well as from data science teams to have better quality, well defined data ultimately drives the the organization to, sort of go in that direction.
[00:35:54] Unknown:
Because of the fact that feature engineering is such a all encompassing process where it requires quality data, it requires up to date information, it requires cooperation and collaboration across different stakeholders within the team or within the organization. What are the communication channels or collaboration utilities that are necessary to be able to, you can't say ensure success, but at least to help promote success in this overall effort?
[00:36:25] Unknown:
Indeed. Yeah. That's, you know, communication between these teams is is definitely a a a huge challenge. And I think, you know, the the if you look at what the current channels are for communication, it's all the usual ones like email and Slack and spreadsheets, on which, you know, different pieces of information get exchanged. The requirements get tracked and maintained. And while those are a good start, they're definitely not sufficient. You know, they're good for informal communications. But when it comes to sort of formalizing the semantics as well as the requirements behind certain elements of, you know, the data features, etcetera, you need something that's a lot more powerful. And so, you know, this is where a, future platform that allows, you know, those communications, to happen much more efficiently becomes really an integral part of the the overall ML, life cycle and pipeline.
You know, this is not to say that, everything will, you know, necessarily should just live inside a feature platform. There will always be all kinds of informal communication that happen. But when it comes to formal sort of a contract, let's say, between, the subject matter expert expert or the data producer and the consumer in in in this case, the data scientist teams, that needs to be much better for formalized, inside some kind of a feature tool or a platform.
[00:38:06] Unknown:
For teams who are just starting on the journey of building out their ML systems or empowering their data engineer or empowering their data scientists or ML engineers to do this feature development, what are some of the early mistakes that you see them often running into where maybe they would be better served by just buying something off the shelf or using an open source tool? What what are some of the not invented here syndrome cases that you see happen frequently?
[00:38:36] Unknown:
Yeah. That's, that is super interesting. We see that, in a we we've seen that actually in a lot of different places where either due to necessity or, due to the inventiveness or the desire of data engineering teams, there have been interesting attempts at building out, you know, the what I would call first generation feature stores primarily, or feature repositories. And, you know, in in some ways, given the plethora of tools that exist, it's it's not that difficult to come up with infrastructure that allows you to, you know, build these pipelines and, you know, help do feature engineering, effectively.
The the challenge becomes sort of dealing with the overall complexity and just the speed at which things keep changing, in the overall environment, both with data as well as just the underlying technologies. Right? And keeping up with with both of those just becomes a a huge challenge for, data engineering team. So we've seen a lot of places where, you know, these repositories get built. Data scientists come in and start using it. And soon, it it goes from being a nice well governed piece of, you know, infrastructure to becoming what I would call a feature lake instead of feature store or a feature repository where they're you know, every data scientist comes in and and dumps whatever that that that they've done, you know, creates more noise and and and, compute complexity in the overall process.
And it's it's not, you know, that difficult to see when you talk to to clients out there. Feature stores with literally tens of thousands of features that have been created by teams of, you know, 20 to 25 data scientists. And you you you think, okay. Well, how is that possible? Well, 1 of the key reasons is there's no governance. The there's no cataloging of what's already been done. And so you end up with, you know, just a sort of explosion of features in your, feature platform that just becomes almost impossible to to maintain over time.
[00:41:03] Unknown:
So on that point of maintenance, who is primarily responsible for ensuring that you don't end up in that situation of just having a morass of features where you have 5 different versions of a feature that almost do the same thing, but not quite and no clear way to determine which 1 is the 1 that's actually being used? What are some of the ways that teams need to be thinking about how to maintain that overall? Who is responsible for that process? What are some of the overarching analytics that are necessary to be able to understand when it's, necessary and when it's possible to reap old features and just delete them out of the repository, etcetera?
[00:41:45] Unknown:
Yeah. I think that's a that's a great question. I think it it it goes back to to bias the, you know, some sort of standardization in the way features actually get created. Because if you think about it, right, a data scientist that comes in, for them to even, you know, sort of find and trust, the features that have been created and deployed. If it takes longer for them to do that, to understand what's inside a feature versus having, you know, just writing their own code which they trust. You know, obviously, a good data scientist is is going to lean towards the latter, which is, oh, okay. Well, you know, it takes me 2 days to to figure out, the code that's been written by my colleague. I'd rather just write my own and and, you know, just deal with it. And so having a, you know, sort of declarative framework, a structured way, a low code way of creating these features.
That just makes it super easy, first of all, to promote reuse of what already exists. And then you need a a a super strong catalog. And in many ways, we feel like, catalog should be something that's self organizing. It should just, you know, automatically organize features based on, you know, understanding what the lineage associated with the data is, the the class of signal, for example, that it's, you know, emanating based on, the the computation that perhaps, the complexity of the computation. There are different ways of sort of organizing, categorizing these features. Right? But doing that in a in a in a much more automated way is super important because otherwise, you know, you're always relying on data scientists, you know, categorizing and labeling features and and and data that they produce, which is is difficult to, rely on. So that's step number 1. And then, you know, when it comes to overall governance, it's the responsibility of, senior data scientist, perhaps a chief data scientist or, you know, VP or director of data science who has to impose some kind of sort of governance guardrails to ensure that, you know, this doesn't continually happen where there are, you know, features that get created that are ungoverned that may carry sensitive information, and make it available to the general
[00:44:14] Unknown:
public. Yeah. The, security and PII aspect was another thing I was going to ask about. So you you beat me to the punch there. And the other thing I'm curious about to your point of standardization is whether there has been any convergence within the overall ecosystem on the standard set of interfaces, both for exposing information to these different feature platforms and feature stores as far as what data is available, how does it get consumed, where the query or exploration interfaces for being able to process that data through those features. And then on the consuming side for building the ML models, has there been any standardization in terms of how the model training or model development tools and platforms interface with those features to be able to query them as part of that training and model building process.
[00:45:09] Unknown:
Yeah. I think, a lot of the interfaces for, especially for modeling and for model building and and predictions. Those are fairly standardized, although there's no standard that exists. But there's there are not too many variations in the way that modeling tools and modeling platforms consume data. And so that's, you know, fairly easy to to to to sort of standardize around. Creation of features is something that's been very ad hoc over time. And, you know, despite the the availability of, interesting, you know, Python packages, r packages, etcetera, that focus a lot more on the experimentation side, but not so much on actually deploying these feature pipelines.
And so, you know, now there's a a resurgence of interest in being able to create declarative frameworks for, especially in Python that just make it very easy for data scientists to go from declaring features in a few lines of code and, you know, 6 to 8 lines of code and being able to create truly state of the art features and actually deploying those pipelines, being able to backfill, run experiments on, etcetera. So, yeah. It's it's it's a space that's rapidly sort of emerging. In fact, Futurebyte, we just released a a Python SDK not too long ago, you know, just a few weeks ago that addresses exactly that pain point of being able to, create features very efficiently, very quickly, and then, you know, be able to deploy them in production as well.
[00:46:56] Unknown:
From your experience of working in this space, building a tool chain around the development and serving and maintenance of features and working with teams who are going along that journey themselves, what are some of the most interesting or innovative or unexpected ways that you've seen those teams solving that problem, building their own feature platforms, or leveraging existing feature platforms to to power their work?
[00:47:21] Unknown:
Yeah. I think that's, it's interesting. I think most of what we've seen is, what I was describing as sort of 1st generation feature repositories, you know, more like feature lakes. And and we've seen in organizations, you know, literally with tens of thousands of features that just get keep, computed over and over again. You know, 1 1 particular example that comes to mind is, is is super interesting. It's, you know, there was a a a company where the the feature store was actually built by a management consulting company, not a an IT consulting company. Plus, it's 1 of the top management consulting companies that came in and built a feature store.
And, you know, this is many years ago, and the same features are still being utilized now. So it's you know, again, it shows that, yeah, it's not just that, it's it's hard to to do, which is 1 of the reasons why, you know, data scientists tend not to spend too much time, on sort of, you know, trying to, you know, get these features into deployment. But at the same time, it's just so, so critical for for businesses out there to do analytics the right way. There has to be a different way.
[00:48:44] Unknown:
And another thing I was just realizing is that we've spent a lot of the conversation focused on the role of the data engineer in this context of feature engineering and collaboration with ML engineers and data scientists. But another thing that we touched on earlier was the disparity between business intelligence and more point in time analytics versus the continuous time analysis of machine learning models. I'm curious what you have seen as some of the, some of the ways that analytics and business intelligence teams collaborate with the ML teams to understand more about the problem domain or ways that they can feed off of each other in the process of building ML models to support the business based on some of the acquired knowledge through that work of building the more point in time analytical interfaces.
[00:49:39] Unknown:
Yeah. I think that, Tobias, that there are you know, there there's still lots of commonality, in, for example, data quality, which, affects ML pipelines a whole lot more than, BI pipelines and and and BI metrics. So any, you know, changes or corrections that are done in, the data that's being fed into feature pipelines that automatically helps, BI pipelines as well. When it comes to semantics, the need to have a deeper understanding of the data is is much more critical for future pipelines, but, ultimately, it helps, you know, BI pipelines and any kind of BI analytics that's being done as well. So anything that, you know, ultimately, is important for, ML.
It is super helpful, I should say, for, BI and traditional analytics as well. And, you know, when it comes down to it, there has to be a, sort of, you know, clear correlation between features and metrics. If the way metrics are being computed is very different from how features get computed, that's going to lead to challenges down the road in terms of explainability of models, the trust from the business on the, you know, data that's generating certain predictions. And so, you know, the at the end of the day, the the space has to converge when it comes to, data management overall, but that's probably going to take a long while for for something like that to happen.
[00:51:29] Unknown:
And in your experience of building Feature Byte and working in this space of feature engineering and ML model development? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:51:42] Unknown:
I think the 1 of the key things you know, this is this is true even when, my cofounder and I were at DataRobot. It's it's true even now. It's just how, model centric the the world of AI is. Right? And if you look at the the number of tools that are out there for, building models, doing hyperparameter tuning experimentation, MLOps tools, etcetera, You know, the the whole space has evolved so much to help data scientists build, and deploy really good models. Right? But, when you look at the the the first half of the ML life cycle, which is everything to do with data and data management, there is a serious dearth of tools out there. Right? It's, you know, you've got some, labeling tools, data labeling tools. You've got some, you know, obviously, we've got, cloud data platforms that, provide a lot of computational horsepower, but nothing that makes the overall process really straightforward, simple, and and very scalable.
And that's it just feels so lopsided. It's just been the case. When you look at, you know, the data side of the equation, it's literally what it was perhaps back in 2015. So over the last almost a decade, not much has truly changed and evolved. Right? So it's it's something that, you know, from from my opinion, is is is is ripe for disruption or ripe for major innovation in the space.
[00:53:16] Unknown:
For people who are interested in digging more into the concepts around feature engineering, feature development, feature platforms, what are some of the resources that you have found most helpful in understanding and designing those capabilities?
[00:53:31] Unknown:
Yeah. So, for, you know, for for for me personally, it's, you know, being looking at, what some of the open source, you know, large, companies that have, open sourced some of their projects, what they've been able to do. So there are open source projects like Feast and, others. Feather is 1 example coming from LinkedIn. They've done a a really good job of sort of putting feature stores in the limelight, so to say, and, being able to sort of express the the the criticality and the need for feature stores in the overall infrastructure. But outside of that, you know, I look at, you know, the the inspiration from Kaggle Grand Masters. We've got 2 of them, on our team to understand, the overall sort of power and criticality of feature engineering, how it's helped them, you know, win amazing competitions, what they're able to do with the data, how they're able to derive some really powerful signals from the data, you know, talking to customers, and understanding what their needs are, you know, that's, super important for for us as we're, you know, building these capabilities out, at feature by itself. And then, believe it or not, just, you know, going through social media, especially on LinkedIn and looking at data engineers voicing their frustrations about dealing with future pipelines and ML pipelines and, you know, the the kinds of things that they have to the machinations they have to go through in order to do things well.
Those are just, you know, truly interesting inputs for us to consider as or for any team to consider as they're building, these capabilities out. 1 other thing I was realizing
[00:55:26] Unknown:
I didn't touch on and is maybe a conversation better suited for my other show about machine learning is for this work of feature engineering, are there particular categories of model architectures or model types? So deep learning versus, linear regression, etcetera, where feature engineering is more useful. In particular, I'm wondering if deep learning workflows really leverage feature engineering and and feature capabilities versus just feeding it a whole bunch of data and hoping something useful comes out the other side.
[00:56:02] Unknown:
Yeah. That's, that's been, I think, a topic of conversation, again for at least a decade, if not more, and I'll give you an interesting anecdote. But to answer the question, feature engineering is is is something that's absolutely necessary for tabular data. Deep learning models don't do a really good job of, you know, building really good models with the tabular data. There isn't a lot of it available in open source, or in the open domain for, deep learning models to to to learn from or chew on. Deep learning models just have a much larger appetite for for data than what most, enterprises, kinda tend to have.
And so, at least for for now and for the foreseeable future, you know, you need feature engineering to be able to sort of, take raw data and create a representation of the real world for algorithms to be able to learn from and be able to do predictions on. And and to your point, when it comes to deep learning and deep learning models, you know, the the you just feed in as much data as as the algorithm can chew and as much compute as, is available to you and, you know, interesting things tend to come out of that as we've seen more recently. The anecdote that I was, you know, going to share is I had the really good fortune of just, having a quick chat with Ian Lacun at at a breakfast in, in in Boston, and I was asking him about, you know, feature engineering and what what he thinks about the general space.
And his, his comment was really interesting. He said, well, I've been I've spent the last 2 decades of my professional career trying to get rid of feature engineering. But when it comes to tabular data, there is really no option out there. So yeah. I mean, that that's how the the world is going to be, at least for now. And so, you know, when even, for companies who are thinking about, okay. Well, you know, do they do they need feature engineering or especially with, you know, the the Gen AI and and the elements out there, how should they be thinking about feature engineering For any kind of predictive problems, you ultimately need, especially predictive problems that involve tabular data. You need a combination of features and embeddings potentially from unstructured data, you know, textual data, NLP, information that may exist, And we need to combine both of those in order to build a really powerful machine learning model.
[00:58:47] Unknown:
Are there any other aspects of the overall space of feature engineering, feature development, feature platforms that we didn't discuss yet that you would like to cover before we close out the show?
[00:58:58] Unknown:
I think we we touched upon quite a bit. So Alright.
[00:59:02] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:59:18] Unknown:
Tobias, you know, my response to that question is obviously going to be a little biased, but I'll try my best to put it in in more general terms. Right? When you look at the modern data stack, you look at pretty much, you know, any image that represents all the different layers of the the modern data stack. Right? There are all kinds of interesting tooling available for traditional BIN analytics. So, you know, you have ELT that just makes ingestion super easy. You've got, you know, tools like BBT for transforming your data for, analytics. You've got all kinds of interesting analytic and visualization tools. It it all looks very neat and tidy and pretty well layered.
1 of the things that's missing is ML tools and platforms. Right? I I rarely come across a a picture or an image of the modern data stack that has the ML side of the equation well integrated into the overall map. And when you try to draw that out, the connection between the modern data stack, the the cloud data platform, and your, ML tooling is sort of very squiggly and a mess because it's all, you know, manually set up and managed. It's, you know, wild west in some ways. You know, every organization does things very differently. And if, you know, for enterprises that are serious about scaling, that part of the equation just absolutely needs to change, and it has to change ASAP in many ways. Right? I'll just go so far as to say that, you know, it's not having a self-service data platform for data scientists is 1 of the biggest hurdles in being able to to to to scale, enterprise AI. And it's something that's a badly needed extension to the modern data stack.
So, you know, for from my perspective again, you know, the that's 1 of the the the biggest sort of data management challenges or 1 of the widest data management challenges that, I see out there, that just needs to be fixed. And, obviously, the modern data stack just allows the ability to do that very effectively.
[01:01:33] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts on the overall process and workflows and requirements for future engineering and the capabilities that it enables. It's definitely a very interesting space. Appreciate all of the work that you and your team are putting into making that a bit more of a tractable problem. So thanks again for taking the time, and I hope you enjoy the rest of your day. Thank you very much for having me, Tobias. Really appreciate it. The the biggest sort of data management challenges or,
[01:02:04] Unknown:
1 of the widest data management challenges that, I see out there, that just needs to be fixed. And, obviously, the modern data stack just allows the ability to do that very effectively.
[01:02:15] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts on the overall process and workflows and requirements for future engineering and the capabilities that it enables. It's definitely a very interesting space. I appreciate all of the work that you and your team are putting into making that a bit more of a tractable problem. So thanks again for taking the time, and I hope you enjoy the rest of your day. Thank you very much for having me, Tobias. Really appreciate it.
[01:02:49] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the machine learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to the Guest and Topic
Understanding Features in Machine Learning
Feature Engineering: Process and Importance
Feature Stores and Platforms
Lifecycle of a Feature
Challenges in Feature Engineering
Interfaces and Collaboration
Foundational Components for Feature Engineering
Common Mistakes and Governance
Collaboration Between BI and ML Teams
Lessons Learned and Future Directions
Closing Remarks