How Data Engineering Teams Power Machine Learning With Feature Platforms

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/

rudderstack.

Your host is Tobias Macy, and today, I'm interviewing Razi Raziuddin about how data engineers can empower data scientists to develop and deploy better ML models through feature engineering. So, Razi, can you start by introducing yourself?

Absolutely. Thanks, first of all, Tobias, for having me on the podcast.

Very excited to be here.

I'm the CEO and cofounder of Featurebite.

We're a Boston based startup,

focusing very much on scaling enterprise AI primarily

by

radically simplifying feature engineering and management.

So my, you know, I've been in the Boston area for

now over 2, sort of,

data and analytics,

bringing innovative data and analytics products, to market.

Had the

good fortune of being on the leadership team of, 2 different unicorn startups.

I was at their robot most recently

where I helped scale the the company from

just around 10 employees, when I first joined to 850 employees,

in under 6 years, which was just an amazing amazing ride.

And prior to that, I was, at another unicorn

before the term was used for, you know, describing,

interesting startups. I was at a company called Netezza, which is also Boston based.

Did an IPO,

and then was sold to IBM for close to 2, 000, 000, 000 in in 2010.

So just being in the space has been fascinating. It's,

evolving every other day. There's just something new to learn. You know, it's a very exciting space, so look forward to the conversation. And do you remember how you first got started working in data? So I moved to the Boston area

with, a company called EMC, which is now part of, Dell,

and, that was in the data space. I was fascinated with with data for 1 reason or the other,

right from grad school. So I I wanted to be in the data space. And then I really got hooked into analytics,

and data when I was at Netezza.

I got the firsthand view of, basically how data and analytics helps drive businesses,

and saw the power of analytics and how it can be used to truly create a a moat

around businesses.

You know, so our customers were just fascinated with the kinds of things that they were do they they were able to do with with the data. And then, you know, moving into the robot,

you know, the power of ML was just,

truly, truly amazing. You know, it was just,

again, just amazing to see what,

data analysts and business analysts were able to do with AutoML,

or they were able to create

just super interesting

models

by the, you know, just uploading their data and and clicking

a few buttons here and there.

And it's just, again, it's fascinating to for for me to be at this sort of convergence of

business and technology, which is really where analytics lives.

And then you look at the space. It's just, you know, there's something new happening every other day. It's just so exciting. There's so much innovation.

So you feel you'll learn something new literally every single day. Absolutely. That that's definitely how it's been with running the this podcast and my other shows and just working in the space for the past, I think decade plus at this point.

Focusing in on the topic at hand for today, we're talking about feature engineering.

And before we get too far into that, why don't we just start by giving a recap of what even is a feature and how is it distinct from a table in a warehouse or,

an ML model or any of the differ any of the other data assets that we produce as data engineers?

No. Absolutely. Yeah. That's, that's a great question.

I think the the the best place to start is just, you know, the fact that great,

AI needs great data, for it to be successful.

Great AI starts with great data.

And features

are the the data elements that get fed into

algorithms to to train machine learning models and to be able to do predictions

off of these machine learning models.

And so

the,

you know, you can think of features as just elements of, you know, different entities. So there are attributes of different entities.

Good

examples

are,

demographic information associated with, for example, a customer or a product.

So age, sex, gender, etcetera, those are, you know, simple features that can just be derived from,

data that exists around

a particular entity.

But features can get super complex,

and a lot of the the features actually get derived from raw data.

And so, you can think of, for example, in the the case of

credit card fraud as an example,

knowing that

I'm sitting in Boston,

while my credit card is being swiped in, I don't know, Moscow or Timbuktu.

That's a pretty good indication that there's some kind of fraudulent transaction happening.

But that's not something that's intuitive to machines. And so the data that's available needs to be represented in a form

that algorithms can easily learn from. So in this particular case, the example that I just quoted, a

data scientist would have to, kinda derive that feature by calculating the distance between where credit card holder like myself is sitting and where the transaction has taken place. Now the, you know, the the distance, which is now measured in, let's say, you know, thousands of miles,

helps

the algorithm, helps the the machine learning model understand

something about that transaction,

and,

the model is able to flag it as potentially fraudulent. Right?

This is again

a pretty simple example. Features can get super complex,

especially when you're trying to represent,

you know, purchase behaviors and patterns. For example, are you a regular shopper or a binge shopper?

Differences between shopping and behaviors and patterns associated with a particular age or demographic.

What are the kinds of things, you know, you like to buy? How consistent you are? How diverse,

the the the purchase patterns are? Those are the kinds of things that constitute,

features. And the more complex they are, the more signals they carry within them

in order to, again, train models as well as, be able to derive some really good predictions out of those models.

And so in the process

of developing a feature, the term feature engineering comes up a lot. And I'm wondering if you can talk through what that is,

why it matters,

who cares about feature engineering, and in what ways is it distinct from some of the other types of data engineering pipelines that folks might be familiar with. No. Absolutely. And so so the the the process of basically taking raw data, which, exists in, you know, data warehousing data warehouses and data lakes

into these features. That process is feature engineering. So it's a it's a it's a critical part of the whole machine learning life cycle

where you're transforming your data into,

your raw data into data that can be easily fed

into machine learning algorithms for for training and for predictions.

So as as far as, you know, how

feature engineering and, you know, feature pipelines are different from, you know, traditional,

data pipelines,

that's a topic that,

you know, it's,

it all depends on how much time you have and how much time you can spend on on talking about that. But,

you know, happy to to to dive into it. I think 1 of the most common misconceptions

in especially in the data engineering world,

is

that a pipeline is a pipeline at the end of the day. Right? It's just taking data from 1 place to another, doing some transformations.

And, you know, when you think about features,

you know, they they are transformations in some way. Conceptually, they're very similar to doing ETL type of transformations or ELT transforms.

The the challenge is,

and, you know, this is something that just,

you

know, helps, sort of get across the difference between,

you know, traditional

BI

and analytics pipelines and,

machine learning pipelines.

The fundamental difference is, you know, b I is for humans

and for human consumption and AI is for machines. Right?

And so,

you know,

that fundamental difference leads to all kinds of complexity

in future pipelines. So future pipelines,

when you dig underneath, they're at least an order of magnitude more complex, if not even more so compared to,

you know, your traditional BI and analytics pipelines.

If you think about the scale,

each machine learning model can easily utilize

hundreds of features whereas dashboards may have, you know, 5 to 10 metrics max,

again, for more human consumption.

When you look at the data volumes,

you know, typically, your

model training requires

lots of detailed historical data,

which means you're dealing and doing a lot of complex computations

on literally large volumes of

of data, which makes

data movement,

through these pipelines super expensive and inefficient.

The computational complexity

is is much higher

in in in features as opposed to calculating metrics.

Metrics are typically simple,

statistical functions that involve,

let's say, 2, 3, maybe 10 columns.

When whereas when it comes to features, you could have many different tables, many different columns being joined together,

and super complex computations,

in a time aware manner being done. And so the the time awareness, time travel,

the need for

consistency between training and,

serving,

all of those things just make

the ML pipelines and feature pipelines super complex

and very difficult to

sort of

build, manage, govern,

deploy, and just make sure they're healthy and and operational.

In terms of the feature itself,

it it's seems like it's distinct from the output of an ETL pipeline in that the ETL pipeline,

once it's produced the output, that output is static until the next time the pipeline runs.

Whereas the feature sounds like it's more actually the definition of a function that is used to compute the actual value at request time

versus just loading something that has already been computed because of the fact of what you were saying of needing to be able to have this historical

look back of what are the values over time.

So a single feature

is actually the function that's used to compute those values at whatever time span is being requested by

the training operation.

No. Absolutely. Yeah. I think that that's, that's again a a super important point, Tobias, which is,

you know, you you're you're constantly

computing features as new data arrives.

So there are, you know, jobs that are running in the background,

basically computing features, making them available,

for predictions

that typically require,

low latency,

much lower latency compared to

your,

BI report or a dashboard.

And so

that again increases the complexity of,

when data is becomes available

and when, you know, how long it takes to to actually compute the feature

and make the results of the feature available for, consumption by, different applications out there. And another

term that comes up often in conversations

where features or feature engineering

are present is the idea of a feature store or a feature platform. I'm

wondering if you can talk through your views on the necessity of a feature store in relation to the computation and delivery of features, and what are the trade offs for having that be its own

dedicated architectural or infrastructure component?

No. Absolutely. Yeah. Yeah. I think, from from from my perspective,

feature stores are absolutely critical,

component of the the machine learning architecture.

You know, you need

the ability to deploy pipelines consistently

and quickly.

That is something that, feature stores

make available.

The also, the the the the need to serve, features at low latency, that's also something that

is is a critical element of what, future stores deliver,

for machine learning pipelines and architectures.

I think that,

it's it's really interesting you bring up the the point around whether,

feature stores need to be

a separate component or a separate infrastructure altogether,

versus being integrated into,

the the modern data stack.

For for us,

you know, the the if you look at traditionally,

given the the complexity of the compute as well as

the the the scale requirements,

you you pretty much had to go off and build a feature store that was completely separate and a dedicated infrastructure

to deal with the processing and,

serving requirements for features.

But this ends up creating all kinds of challenges.

You you now have to manage,

you know, build and manage a completely separate infrastructure, which means additional cost and and and separate teams,

that are dedicated to to to,

managing this infrastructure.

Data privacy and governance just becomes a huge concern.

Features are,

you know, they carry lots and lots of signals

of associated with highly sensitive data,

and not being able to manage that in an environment where the rest of the data is managed is just,

becomes hugely challenging. And then, you know, if you have a

separate platform, a separate environment for

processing data, it just leads to data inconsistency as well.

Because data that's,

landing in your data warehouse may be slightly different from, you know, when it gets pulled into a completely different environment.

So for from our perspective, if you look at, you know, what,

the the modern data stack looks like and and the power,

you know, just that's just available in modern cloud platforms like the the Snowflakes and the the Databricks and BigQueries of the world.

There's just so much,

compute power available now, especially with the separation of compute and storage.

And

plus the the kinds of capabilities that are being built into these platforms just makes it very easy to to run

super complex computations on on on massive data sets. So it just makes sense now to build feature stores in the data platform itself.

Just push the compute to where the data lives instead of having to move the the data into a completely separate environment. It just,

you know, automatically leads to better governance, better utilization of, infrastructure that already exists. You know, just mitigates some of the privacy concerns, increases consistency of the data.

And, you know, there are going obviously, going to be certain situations where

the the current data platforms,

they just don't have the ability to,

you know, just do low latency serving as an example or

acting as a really good cache or, being able to process

embeddings from

NLP and and,

LLMs,

very efficiently. In those cases, you just extend the the modern data stack instead of building a completely separate infrastructure.

That's kinda our view,

at at least for what

customers should be thinking of building.

Digging more into the

special case of feature engineering as compared to ETL workflows, I'm wondering if you can just talk through what is the

overall life cycle of a feature going from

the,

ideation of what it should be through to the definition to deployment and maintenance.

Yeah. Absolutely.

Yeah. This is, you know, this is something that yeah. I'll I'll talk about the

the life cycle of the features and also

touch upon where things are, kind of broken, from our perspective.

So the the life cycle itself is, you know, you create features, you run experiments

on the features. So you build models

to understand which features are, the most interesting given a particular use case and the type of model you are trying to create. And then you serve those features,

for

training as well as primarily for predictions,

and then manage and govern,

the features, monitor the health of the pipeline. So if you just, you know, sort of dig into each 1 of those,

you know, the process of just coming up with the right features,

that requires a lot of domain knowledge and

deep understanding of the data.

And it requires a lot of data science and coding skills to extract the right signals from the data itself. Right? So you just to create the features,

you need to have,

the data scientists

and domain experts and SMEs sort of working together.

And then when it comes to experimentation,

it's about creating

a very accurate view of

historical data to train models, the process that's known as backfilling,

and then being able to, you know, train the models themselves to understand which features are important. Thankfully, you know, when it comes to training models, there's all kinds of tools available,

AutoML tools out there that just

make it very, very easy for,

you know, even non data scientists or citizen data scientists to go off and and build models very efficiently.

The 1 of the key areas of challenge comes up when you're trying to sort of serve,

you know, build data pipelines to serve features

for inferencing or for predictions.

That's where you need

a lot of help from data engineers

to build

machine learning pipelines.

And that

interaction between the data scientists and data engineers

is fraught with friction. It just slows down the entire process.

We've seen

many situations where it literally takes weeks, if not months,

to go from when

the, you know, features are even created and the experimentation done to actually deploy

features in production.

And that just, you know, it it's it's been 1 of those things where we, you know, see that as a major sort of stumbling block

in being able to truly scale enterprise AI.

And then finally, you know, you've got management and governance.

You know, this is becoming increasingly important even for,

nonregulated

industries where it's important to

know sort of, who touched what data,

who has access to which data,

ensuring that,

the health of future pipelines

as

data

is constantly changing as well as being able

to manage the cost associated with,

you know, the the number of features just exploding in different environments,

and ultimately governing, you know, which features actually get deployed,

where they get used, and by who.

All of that needs to be sort of centrally managed and governs

as far as possible. So that's

the

life cycle of of of features and sort of, you know, some of the challenges associated with the with the life cycle.

In terms of the

sharp edges or roadblocks that manifest

in that overall life cycle,

what are the biggest pain points that you see teams running up against? And what is the role of the data engineer in

addressing those, maintaining those, enabling other members of the team and other participants in that process to get their work done

without having to ask the data engineer for permission or to do a particular step every time they need to get something done. Yeah. I think you you, you know, you're you're

pointing to an area that's, as I was mentioning, is just,

you know, truly fraught with with friction in some ways.

1 of the key challenges with this whole life cycle

is it requires 3 different skills. You need, the the domain knowledge. You need data science expertise. You need data engineering

expertise to come together.

You know, you need domain knowledge to be able to determine, okay, what signals to extract from the data itself. You need data science knowledge to extract those signals and then ultimately,

data engineering,

expertise to be able to perform all of these operations at scale.

And the challenge is, you know, these skills

tend to live in different teams

within

a given organization.

They use different tools. They speak different languages. They don't necessarily understand each other. And so you can, you know,

just,

pretty much see where the the the points of friction are. Right? It's when, you know, you have

3 completely different teams trying to interact with each other, it just makes that entire process super complex.

And then, you know, honing in on,

the interaction between data scientists and data engineers,

that it just

is is

is a huge problem in of itself.

Even though, you know, both these personas

work on data,

the the way they approach

problem solving and the the the way they approach the the data itself

is just vastly different.

You know, it's, it's almost like the the old Mars and and Venus book,

where data engineers or, you know, take your pick. Data engineers are from Mars and, data scientists from Venus or the other way around.

But, you know, the fact of the matter is that data scientists are experimenters.

Right? They work on open ended problems.

They like to continually iterate on data,

run bunch of experiments,

whereas data engineers are they're builders.

They work on, you know, well defined specifications

to build pipelines that,

you know, are are healthy. They're easily maintainable,

and they do what they're supposed to do. You know, they're the opposite of experimenters.

And so when these 2 personas come together,

it it obviously creates,

you know, huge friction,

to to to get things done.

And then, you know, when when when we look at this space out there, there's a, you know, dearth of tools,

to make data scientists a lot more self reliant,

when it comes to to to data,

and and when it comes to doing, you know, using and preparing data for for modeling.

And for data engineers to be able to create,

you know, just

these,

ML ready pipelines,

incorporating some of the key requirements such as time awareness, low latency serving, etcetera.

And I think it just exacerbates

the the challenge and the the the friction in the whole process.

For data engineers who are working in a team where data scientists or ML engineers are going to be the primary,

definers and maintainers of features, what are the

interfaces that they're going to be looking to use to more closely integrate that process into their workflow and their tool chain?

Yeah. That makes sense.

You know, I think the

for

for for for us, for, you know, just looking at it from from a perspective of the data scientists, I mean, ultimately,

data scientists are the ones who come up with features.

They are the ones who are using these features for building models

and they're ultimately

responsible for the

overall predictive power and,

the health of the models

and by

association, the health of the features that get built.

And so,

you know, from from from my perspective,

data scientists are the ones who should ultimately be

responsible for the entire,

future life cycle.

And data engineers or ML engineers should be there in a support function to

create the the right environment and create the right architecture that makes it easy for data scientists to

consume data, build pipelines, deploy,

you know, to do the entire life cycle in a self-service manner. Right?

And so when it comes to the interfaces, you know, when

if we

sort of double click on this and look at each of the, you know, 4 steps I described in the future life cycle,

It'll be interesting

to dive into that and,

you know, sort of look at what those interfaces would actually mean.

When it comes to feature creation,

it's as I was mentioning, it's it's super important for data scientists to understand

the domain

and the context behind

the the data.

So some way of making it easy for data scientists to interact with the domain,

experts and subject matter experts,

and some way of, sort of creating a common language so that the interaction becomes much easier is is super important.

Give you an example,

you know, when it comes to to different signals and signal types associated

with features,

both data scientists as well as

your domain experts understand the RF and recency, frequency, and monetary metrics. Right? That is 1 way,

so extending RFM

into different dimensions such as,

diversity of purchases, similarity of behavior, etcetera, those are ways in which

you can make the interaction between

SMEs and data scientists a whole lot easier.

Right? The,

rise in popularity, for example, of semantic layers,

that's, you know, that's primarily being,

driven

by the need of having data professionals and subject matter experts and business professionals,

come together and have a common way of sort of interacting and speaking with each other. That, again, will help this this this entire process of interacting with

the,

subject matter experts.

The the other,

challenge that we see over and over again is the you know, just having to write tons and tons of code for

doing the the same sort of things over and over again. Alright? So it's,

instead of having to write tons of Python codes to,

create features,

it should just be a simple declarative language that makes it very easy to, you know, take even some of the more state of the art functions

that, for example, Kaggle grandmasters

use to to win Kaggle competitions and just be able to use those,

and have all the the the back end SQL or Spark,

auto generated, which will make the entire process of not just feature creation,

but also the experimentation

and pipelining a whole lot easier. Right? So so so that's when it comes to feature creation itself. If we

get to experimentation,

there's a huge reliance on data engineers to

give data scientists different cuts of data,

you know, different slices of data on which, you know, they

can do their experiments

in in a silo.

I think that process again is is sort of fraught with challenges and risk. 1st, there is, you know, an obvious reliance on

data engineering teams just to, you know, pull data from a data warehouse or data lake and dump it into a different environment. But also the fact that the your experimentation is being done in a silo in a different environment as opposed to where the

model or or the

the the the feature pipelines are ultimately going to get built. That just leads to potential errors and inconsistency in the data. Right? And so, you know, just

having the ability to actually run your experiments on live data

is super super helpful for data scientists. It just shortens the the the the time between

coming up with ideas and actually running experiments,

and being able to do that again in an environment where the future pipelines are ultimately going to live. And then, you know, for this, we see a lot of,

sort of reinvention of the wheel going on.

So if, there's a really powerful feature catalog that exists that makes it easy for data scientists to come in and,

you know, not have to recode everything that's already been done,

reuse what's already out there. You know, that is just a a super efficient way of sort of, you know, and an interface that makes experimentation super easy. Moving on to serving,

you know, this is as we were discussing earlier, this is 1 of the most painful parts where it it's it's

a a very error prone and sort of time consuming cycle.

And,

you know, for for for me, the the,

a lot of the the challenges come from

taking,

features that have been created in Python

and translating them into Time Aware SQL and Spark. Right? And that process,

there's no reason why that should be a manual process. Right? That's

with,

you know, these elements and generative AIs

being able to even understand

language. We should just be able to very easily and simply take

features that have been declared in in Python and create the equivalent

spark or or or SQL,

that run that's appropriate for the for the right back

end,

and deploy those pipelines and,

have the

the the the jobs associated with those pipelines run-in an automated fashion. Right? And ultimately, you need,

really good management because these features,

they they carry sensitive information.

They're sort of super curated

pieces of data that need to be managed very effectively and efficient efficiently.

And,

ultimately, any, you know, good self-service environment

just needs to have the the right oversight in order for

smooth functioning,

and being able to to to run the right way. And

so having,

you know, giving the ability for the organization to

sort of centrally

manage and govern and monitor the health of the pipelines, the cost associated with these different features,

giving them the ability to,

manage the

privacy and and sort of access controls around features and feature pipelines,

that becomes super important as well.

That I think,

you know, sort

of creating that environment

will truly sort of,

you know, unleash the ability for data scientists to just do a whole lot more.

It's it's 1 of those things that's hold you know, that holds data scientists back when it comes to

enterprise AI, and we feel like this is something that's just gonna 10 x the the ability for data scientists to to just go off and

build more models, deploy them, manage and monitor, and keep updating them as there's

new data that arrives, as there are changes that happen,

in the market and the market conditions themselves. And in terms of the

creation of these platforms to enable feature engineering, what are the foundational components that need to be in place before the engineering team could start to think about building out these higher levels?

And in particular, I'm interested in

the data discovery

aspect for the ML engineers and data scientists who want to define features being able to understand what are the pieces of information available to them for,

doing that exploration and experimentation?

Yeah. I think it it's a it's a it's a really interesting question. I think,

you know, just having,

obviously,

as good quality data,

as available

is

is a very important prerequisite

for doing any kind of analytics, whether it's, you know, your traditional BI or doing ML,

on the data itself.

Now I'll also caveat that by saying that no 1 has very good,

data. Right? I I don't know of any team out there

that'll claim that their data is perfect or close to perfect. There are always issues with the data. And so,

you know, this is always

a a journey to get to better and better data and and

high quality data over time.

But that's,

you know, that

even with decent quality data,

you know, you can do a whole lot and and and that's an important prerequisite.

The same thing applies to,

sort of the understanding of the data. You

know, the if you look at

data dictionaries and,

the,

sort of data catalogs,

There's just tons of,

information that's always missing.

And when it comes to, data science,

the the more,

you know, the more,

variety of data that exists, the better it is for modeling,

and for for ML purposes.

The the,

again, you know, it needs to be an iterative process where,

you start with the data that's fairly well understood. You build some models,

and then you continue to run experiments with data that perhaps is not as well fleshed out, is not as well understood,

just to work with, the

data SMEs and

business SMEs and,

gain a better understanding.

And, you know, it just becomes a a cycle where over time,

the need from,

the business as well as from data science teams to have better quality,

well defined data

ultimately drives

the the organization

to, sort of go in that direction.

Because of the fact

that feature engineering

is such a

all encompassing

process where it requires

quality data, it requires up to date information, it requires

cooperation and collaboration

across different stakeholders within the team or within the organization.

What are the communication

channels or collaboration utilities that are necessary to be able to,

you can't say ensure success, but at least to help promote success in this overall effort?

Indeed. Yeah. That's, you know, communication between these teams is is definitely a a a huge challenge.

And I think,

you know, the the if you look at what the current channels are for communication,

it's all the usual ones like

email and Slack and spreadsheets,

on which,

you know, different pieces of information get exchanged.

The

requirements

get tracked and maintained.

And while those are

a good start, they're definitely not

sufficient. You know, they're good for informal communications. But when it comes to

sort of formalizing

the semantics as well as the requirements behind

certain elements of,

you know, the data features, etcetera,

you need something that's a lot more powerful. And so, you know, this is where a,

future platform that allows,

you know, those

communications,

to happen much more efficiently

becomes really

an integral

part of the the overall ML,

life cycle and pipeline.

You know, this is not to say that, everything will,

you know, necessarily

should just live inside a feature platform. There will always be all kinds of informal communication that happen.

But when it comes to formal

sort of a contract, let's say, between,

the subject matter expert expert or the data producer

and the consumer in in in this case, the data scientist teams,

that needs to be much better for formalized,

inside

some kind of a

feature tool or a platform.

For teams who are just starting on the journey of building out their ML systems or

empowering their data engineer or empowering their data scientists or ML engineers to do this feature development,

what are some of the early mistakes that you see them

often running into where maybe they would be better served by just buying something off the shelf or using an open source tool?

What what are some of the not invented here syndrome

cases that you see happen frequently?

Yeah. That's,

that is super interesting. We see that,

in a

we we've seen that actually in a lot of different places where either due to necessity

or,

due to

the inventiveness or the desire of data engineering teams,

there have been

interesting attempts at building out,

you know, the what I would call first generation

feature

stores primarily,

or feature repositories.

And,

you know, in in some ways, given the plethora of tools that exist, it's it's not that difficult to come up with infrastructure that allows you to,

you know, build these pipelines and,

you know, help do feature engineering,

effectively.

The the challenge becomes

sort of dealing with the overall complexity

and just the speed at which things keep changing,

in the overall environment, both with data as well as just the underlying technologies. Right? And keeping up with with both of those just becomes a a huge challenge for, data engineering team. So we've seen a lot of places where,

you know, these repositories get built.

Data scientists come in and start using it. And soon, it it goes from being

a nice well governed

piece of,

you know, infrastructure

to becoming what I would call a feature lake instead of feature store or a feature repository where they're you know, every data scientist comes in and and dumps whatever that that that they've done,

you know, creates

more noise and and and, compute complexity in the overall process.

And it's it's not, you know, that difficult to see when you

talk to to clients out there.

Feature stores with literally

tens of thousands of features that have been created by teams of, you know, 20 to 25 data scientists.

And you

you you think, okay. Well, how is that possible?

Well, 1 of the key reasons is there's no governance.

The there's no cataloging

of what's already been done.

And so you end up with, you know, just a sort of explosion of features in your,

feature platform that just becomes almost impossible to to maintain over time.

So on that point of maintenance,

who is

primarily responsible for ensuring that you don't end up in that situation of just having a morass of features where you have

5 different versions of a feature that almost do the same thing, but not quite and no clear way to determine which 1 is the 1 that's actually being used?

What are some of the

ways that teams need to be thinking about how to maintain that overall? Who is responsible for that process? What are some of the

overarching analytics that are necessary to be able to understand when it's,

necessary

and when it's possible to

reap old features and just delete them out of the repository, etcetera?

Yeah. I think that's a that's a great question. I think it it it goes back to to bias the,

you know, some sort of standardization

in the way features actually get created.

Because if you think about it, right,

a data scientist that comes in,

for them to even, you know, sort of find and trust,

the features that have been created and deployed.

If it takes longer for them to do that, to understand what's inside a feature

versus having, you know, just

writing their own code which they trust.

You know, obviously, a good data scientist is is going to lean towards the latter, which is, oh, okay. Well, you know, it takes me 2 days to to figure out, the code that's been written by my colleague. I'd rather just write my own and and, you know, just deal with it.

And so having a,

you know, sort of declarative framework, a structured way,

a low code way of

creating these features.

That just makes it super easy, first of all, to promote reuse of what already exists.

And then you need a a a super strong catalog.

And

in many ways, we feel like,

catalog should be something that's self organizing. It should just,

you know, automatically organize features based on,

you know, understanding

what the lineage associated with the data is,

the the class of signal, for example, that it's,

you know, emanating based on, the the computation that perhaps,

the complexity of the computation. There are different ways of sort of organizing, categorizing these features. Right?

But doing that in a in a in a much more automated way is super important because otherwise, you know, you're always relying on data scientists, you know, categorizing

and labeling

features and and and data that they produce, which

is is difficult to,

rely on. So that's step number 1. And then, you know, when it comes to overall governance, it's the responsibility

of, senior data scientist, perhaps a chief data scientist or,

you know, VP or director of data science who has to impose some kind of sort of governance guardrails

to ensure that, you know, this doesn't continually happen where there are, you know, features that get created that are ungoverned that may carry sensitive information, and make it available to the general

public. Yeah. The,

security and PII aspect was another thing I was going to ask about. So you you beat me to the punch there.

And the other thing I'm curious about to your point of standardization

is whether there has been any

convergence within the overall ecosystem

on the standard set of interfaces,

both for

exposing information

to these different feature platforms and feature stores as far as what data is available, how does it get consumed, where the

query or

exploration interfaces for being able to process that data through those features.

And then on the consuming side for building the ML models, has there been any standardization

in terms of how the model training or model development tools and platforms

interface with those features to be able to query them as part of that training and model building process.

Yeah. I think,

a lot of the interfaces

for,

especially for modeling and for model building and and predictions. Those are

fairly standardized, although there's no standard that exists. But there's there are not too many variations in the way that

modeling tools and modeling platforms

consume data.

And so that's,

you know, fairly easy to to to to sort of standardize

around.

Creation of features is something that's been very ad hoc

over time.

And, you know, despite the the availability of,

interesting,

you know, Python packages, r packages, etcetera,

that focus a lot more on the experimentation

side, but not so much on actually

deploying these feature pipelines.

And so,

you know, now there's

a a resurgence of interest

in being able to create declarative frameworks for,

especially in Python that just make it very easy for data scientists to

go from

declaring features in a few lines of code and, you know, 6 to 8 lines of code and being able to create truly state of the art features and actually deploying those pipelines, being able to backfill, run experiments on, etcetera. So,

yeah. It's it's it's a space that's

rapidly sort of emerging.

In fact, Futurebyte, we just released

a a Python SDK

not too long ago,

you know, just a few weeks ago

that addresses exactly that pain point of being able to,

create features very efficiently, very quickly, and then,

you know, be able to deploy them in production as well.

From your experience

of working in this space, building a tool chain around the development and serving and maintenance of features and working with teams who are going along that journey themselves, what are some of the most interesting or innovative or unexpected ways that you've seen those teams solving that problem, building their own feature platforms, or leveraging existing feature platforms to to power their work?

Yeah. I think that's, it's interesting.

I think most of what we've seen is,

what I was describing as sort of 1st generation feature repositories,

you know, more like feature lakes.

And and we've seen in organizations,

you know, literally

with tens of thousands of features that just get keep,

computed over and over again.

You know, 1 1 particular example that comes to mind is,

is is super interesting. It's,

you know, there was a a a company where the the feature store was actually built by a management consulting company, not a an IT consulting company.

Plus, it's

1 of the top management consulting companies that came in and built a feature store.

And,

you know, this is many years ago, and the same features are still being utilized now. So it's

you know, again, it shows that, yeah, it's not just that,

it's it's hard to to do,

which is 1 of the reasons why,

you know, data scientists

tend not to spend too much time,

on sort of,

you know, trying to,

you know, get these features into deployment. But at the same time, it's just so, so critical for for businesses out there

to do analytics the right way.

There has to be a different way.

And another thing I was just realizing is that we've spent a lot of the conversation focused on the role of the data engineer in this context of feature engineering and collaboration

with ML engineers and data scientists.

But another thing that we touched on earlier was the disparity between business intelligence

and more

point in time analytics versus the

continuous time analysis

of machine learning models. I'm curious what you have seen as some of the,

some of the ways that

analytics and business intelligence teams collaborate with the ML teams to understand more about the problem domain or ways that they can feed off of each other in the process of building ML models to support the business based on some of the

acquired

knowledge

through that work of building the more point in time analytical interfaces.

Yeah. I think that,

Tobias, that there are you know, there there's still lots of commonality,

in,

for example,

data quality,

which,

affects

ML pipelines a whole lot more than,

BI pipelines and and and BI metrics.

So any,

you know,

changes or

corrections that are done in,

the data that's being fed into

feature pipelines that automatically helps,

BI pipelines as well.

When it comes to semantics,

the need to have a deeper understanding of the data is is much more critical for future pipelines, but, ultimately, it helps,

you know, BI pipelines and any kind of BI analytics that's being done

as well. So anything that,

you know, ultimately,

is important

for, ML.

It is super helpful, I should say, for,

BI

and traditional analytics as well.

And,

you know, when it comes down to it,

there has to be a,

sort of,

you know,

clear correlation

between features and metrics.

If the way metrics are being computed is very different from how features get computed,

that's going to lead to challenges down the road in terms of explainability

of models,

the

trust from the business on

the, you know, data that's generating certain predictions.

And so,

you know, the at the end of the day, the the space

has to converge when it comes to,

data management overall,

but that's probably going to take a long while for for something like that to happen.

And in your experience

of building Feature Byte and working in this space of feature engineering and ML model development? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think the

1 of the key things you know, this is this is true

even when,

my cofounder and I were at DataRobot.

It's it's true even now. It's just how,

model centric the the world of AI is. Right? And if you look at the the number of tools that are out there for,

building models, doing hyperparameter tuning experimentation,

MLOps tools, etcetera,

You know, the the whole space has evolved so much to help data scientists build,

and deploy really good models. Right? But,

when you look at the

the the first half of the ML life cycle, which is everything to do with data and data management,

there is a serious dearth of tools out there. Right? It's, you know, you've got some,

labeling tools, data labeling tools. You've got some,

you know, obviously, we've got, cloud data platforms

that,

provide a lot of computational horsepower,

but nothing that makes the overall process really

straightforward, simple, and and very scalable.

And that's it just feels so lopsided. It's just been the case.

When you look at, you know, the data side of the equation, it's literally what it was

perhaps back in 2015.

So over the last almost a decade,

not much has truly changed and evolved. Right? So it's it's something that, you know, from from my opinion, is is is is ripe for disruption or ripe for major innovation in the space.

For people who are interested in digging more into

the concepts around feature engineering, feature development, feature platforms, what are some of the resources that you have found most helpful in understanding and designing those capabilities?

Yeah. So,

for,

you know, for for for me personally,

it's,

you know, being looking at,

what some of the open source,

you know, large,

companies that have, open sourced some of their projects, what they've been able to do.

So there are open source projects like Feast and,

others. Feather

is 1 example coming from LinkedIn.

They've done a a really good job of sort of

putting feature stores in the limelight,

so to say,

and,

being able to sort of express the

the the criticality and the need for feature stores in the overall infrastructure.

But outside of that,

you know, I look at,

you know, the

the inspiration

from Kaggle Grand Masters. We've got 2 of them, on our team to understand,

the overall sort of power and criticality of feature engineering, how it's helped them,

you know,

win amazing competitions,

what they're able to do with the data, how they're able to derive some really powerful signals from the data,

you know, talking to customers,

and

understanding what their needs are, you know, that's,

super important for for us as we're, you know, building these capabilities out, at feature by itself. And then, believe it or not, just, you know, going through social media, especially on LinkedIn

and looking at data engineers voicing their frustrations

about dealing with future pipelines and ML pipelines and, you know, the the kinds of things that they have to the machinations they have to go through in order to do things well.

Those are just,

you know, truly

interesting inputs for us to consider as or for any team to consider as they're building, these capabilities out. 1 other thing I was realizing

I didn't touch on and is maybe a conversation better suited for my other show about machine learning is for this work of feature engineering,

are there particular categories

of model architectures or model types? So deep learning versus,

linear regression, etcetera,

where feature engineering is

more

useful.

In particular, I'm wondering if deep learning workflows

really leverage feature engineering and and feature capabilities versus just feeding it a whole bunch of data and hoping something useful comes out the other side.

Yeah. That's, that's been, I think, a topic of conversation,

again

for at least a decade, if not more, and I'll give you an interesting anecdote. But to answer the question,

feature engineering is is is something that's absolutely necessary for tabular data.

Deep learning models don't do a really good job of,

you know, building really good models

with the tabular data. There isn't a lot of it available in open source,

or in the open domain for,

deep learning models to to to learn from or chew on.

Deep learning models just have a

much larger appetite for for data than

what most, enterprises,

kinda tend to have.

And so,

at least for for now and for the foreseeable

future,

you know, you need feature engineering to be able to sort of,

take raw data and create a

representation

of the real world for

algorithms to be able to learn from and be able to do predictions on. And and to your point, when it comes to deep learning and deep learning models, you

know, the

the

you just

feed in as much data as as the algorithm can chew and as much compute as, is available to you and,

you know, interesting things tend to come out of that as we've seen more recently.

The anecdote that I was, you know, going to share is I had the really good fortune of just, having a quick chat with Ian Lacun at

at a breakfast in,

in in Boston, and I was asking him about, you know, feature engineering and what what he thinks about the general space.

And his,

his comment was really interesting. He said, well,

I've been I've spent the last 2 decades of my

professional career trying to get rid of feature engineering. But when it comes to tabular data, there is really no option out there. So

yeah. I mean, that that's how the the world is going to be,

at least for now. And so, you know, when even, for companies who are thinking about, okay. Well, you know, do they do they need feature engineering

or especially

with, you know, the the

Gen AI and and the elements out there,

how should they be thinking about feature engineering

For any kind of predictive problems,

you ultimately need,

especially predictive problems that involve tabular data. You need a combination of features and embeddings

potentially from

unstructured data, you know, textual data, NLP,

information that may exist, And we need to combine both of those in order to build a really powerful machine learning model.

Are there any other aspects of the overall space of feature engineering, feature development, feature platforms that we didn't discuss yet that you would like to cover before we close out the show?

I think we we touched upon quite a bit. So Alright.

Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Tobias, you know, my

response to that question is obviously going to be a little biased,

but I'll try my best to put it in in more general terms. Right?

When you look at the modern data stack,

you look at pretty much,

you know, any image that represents all the different layers of the the modern data stack. Right?

There are all kinds of interesting tooling available for traditional BIN analytics. So, you know, you have

ELT that just makes ingestion super easy. You've got, you know, tools like BBT

for transforming your data for,

analytics. You've got all kinds of interesting analytic and visualization tools. It it all looks very neat and tidy and pretty well layered.

1 of the things that's missing

is ML tools and platforms. Right? I I rarely come across

a a picture or an image of the modern data stack that has

the ML side of the equation well integrated into the overall map.

And when you try to draw that out, the connection between the modern data stack, the the cloud data platform,

and your,

ML tooling is sort of very squiggly and

a mess because it's all, you know, manually set up and managed. It's, you know,

wild west in some ways. You know, every organization does things very differently.

And

if, you know, for enterprises that are serious about scaling, that

part of the equation just absolutely needs to change, and it has to change ASAP

in many ways. Right? I'll just go so far as to say that, you know, it's

not having a self-service data platform for data scientists is 1 of the biggest hurdles in being able to to to to scale,

enterprise AI. And it's something that's

a badly needed extension to the modern data stack.

So, you know, for from my perspective again, you know, the that's 1 of the the the biggest sort of data management challenges

or 1 of the widest data management challenges that, I see out there,

that just needs to be fixed. And,

obviously, the modern data stack just allows the ability to do that very effectively.

Alright. Well, thank you very much for taking the time today to join me and share your thoughts on the overall

process and workflows and requirements for future engineering

and the capabilities that it enables. It's definitely a very interesting space. Appreciate all of the work that you and your team are putting into making that a bit more of a tractable problem. So thanks again for taking the time, and I hope you enjoy the rest of your day. Thank you very much for having me, Tobias. Really appreciate it. The the biggest sort of data management challenges or,

1 of the widest data management challenges that,

I see out there,

that just needs to be fixed. And,

obviously, the modern data stack just allows the ability to do that very effectively.

Alright. Well, thank you very much for taking the time today to join me and share your thoughts on the overall

process

and workflows and requirements for future engineering

and the capabilities that it enables. It's definitely a very interesting space. I appreciate all of the work that you and your team are putting into making that a bit more of a tractable problem. So thanks again for taking the time, and I hope you enjoy the rest of your day. Thank you very much for having me, Tobias. Really appreciate it.

Thank you for listening.

Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the

machine learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast

dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about

it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links