Repeatable Patterns For Designing Data Platforms And When To Customize Them

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality.

Posthog is your all in 1 product analytics suite, including product analysis,

user funnels, feature flags, experimentation,

and it's open source so you can host it yourself or let them do it for you.

You have full control over your data, and their plug in system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms.

Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog.

Your host is Tobias Macy. And today, I'm interviewing Brandon Beidl about his data platform journey at Red Ventures. So Brandon, can you start by introducing yourself?

Hi, Tobias. Yeah. It's a pleasure to chat with you. So

I have been at Red Ventures for just under 4 years. My time there, I've been focused on machine learning problems, and now I'm ultimately focused on building out a data platform for part of the business.

Prior to Red Ventures, I had roles working across the distribution sector, health care spaces, both as an analyst and a software engineer,

and kind of, like, pull all of those experience together

working on the data platform today. And do you remember how you first got started working with data?

Yeah. So my first experience

working with data

was at a distribution firm I was working on as a software engineer.

I didn't realize they were machine learning problems at the time. I thought they were just, you know, recomitatory problems. We're doing truck routing. Getting the experience into those

simulations

of packing trucks, directing routes was like my first introduction to data and trying to build the applications around that, and it's kind of been a long journey from there, building systems that solve complex problems like that. In terms of the work that you're doing at RED Ventures, you mentioned you've been there for about 4 years now. I'm wondering if you can talk a bit about the sort of overarching

mission that you have and some of the specific

role that you have there and maybe what it is about that particular organization

and the problems that it's trying to tackle that keeps you interested?

RED ventures, some broad context, is a collection of more than a 100 global online brands.

They have, like, a loose cohesive mission, which is try to connect consumers with products and services they might need. So some you've heard of, probably Bankrate,

CNET, Healthline,

the points guy. So those are kind of big brands that we own and operate.

So where I work is actually on an internal team that works as a marketing agency on behalf of our clients.

So we take some of the expertise we've developed on those big brands, and we help other companies optimize their marketing funnels, their page testing,

and just make their digital marketing experience kind of more cohesive and performant.

And where I plug in has been

a couple areas, 1st on paid search optimization. So the 1st 2 or 3 years at Red Ventures, I was exclusively focused on

using machine learning to

bid optimally on paid search

problems.

And now I have kind of evolved my responsibilities

into

trying delivering some of these data platform based capabilities to our clients on in our organization.

In terms of the specific title that you have, I understand that you have some formulation of data product manager as your role. And I'm wondering if you can give some context on where you draw inspiration and direction

how to approach your work given the fact that that is still a

relatively new

and emerging and kind of loosely defined job title?

Since it's a role that bridges so many functions,

I find myself seeking inspiration from

podcasts, Slack channels, books, newsletters,

anything from data engineering, software engineering, product management, and even just mental models and decision making.

And I kind of use these things as a filter function.

I don't use Twitter. I don't use a lot of social media. I try to use, like, human curated sources

so that way I can get a little more signal to the noise.

Do I miss out on a Twitter thread here or there?

Sure. But for my mental health and for just the kind of the time spent doing it, I think it's, like, a much better approach for me. Absolutely. I can definitely agree to the time lost to the echo chambers and agree with your sentiment of avoiding those channels as much as possible.

And then in terms of the types of data products that you're working with, you said that you're working for

an internal team that focuses on engagement with a number of the different brands that the overarching

organization

owns and operates. And I'm curious

if you can give some kind of broad categories of what types of data products you and your data consumers are building and relying on?

I probably should have clarified. We actually are working on behalf of external clients. So we have, like, consumer product companies, banking companies that come to Red Ventures and say, hey, we want you to run our marketing.

But for those companies,

we kind of align ourselves by the kind of the functional areas of our business. So we have paid media teams who are focused on the optimization and automation of

ad targeting.

So we have products associated with that. We have SEO and earned media that we have teams focused on page speed,

page rank, getting the content out for backlinks, and things like that. And then we have customer experience teams who are focused on-site behavioral analytics.

You know, if I serve

on our page content A versus content B, what is the expected output and running those AB tests and analyzing them and kind of moving the ball forward on the on-site experiences.

For being able to power those

use cases and help those external customers, what are the types of

data sources that you're working with and the categories of data and the forms that it takes to be able to power those use cases and build out those analyses?

So for paid media,

a lot of it is coming right back out of those platforms. You know, we need to understand what we spent on a given platform at a given time with given targeting. We pull that back out for our paid media teams.

SEO, we're using a lot of our on-site analytics to

understand how things are performing. We're also pulling things in from third party sources in terms of how we're ranking against our competitors.

And then ultimately on the CE side is back again to this behavioral analytics. Think the segments of the world where we're pulling out page clicked and page views events and kind of, you know, pulling those all together into meaningful

chunks for our business team. So those are 3 kind of primary categories that fuel what we do.

And as far as the

sort of organizational

aspects of it, I'm curious if you can give some sense of the scale and composition

of the technical team that you're working with to be able to build out these

analytics and work with your customers and some of the ways that you structure the team to be able to handle these different types of engagements?

Red Ventures is 10 industry groups. We're 1 of those industry groups. Each of those 10 has their own data team that's gonna have something entirely different. But our group, it's a 200 person team, including business folks. We have a 25 person data team, which includes

product managers, engineers, analysts, data scientists.

We tend to

align our teams with specific client engagements.

So if we have been contracted by a client to work on something for

2 years, we tend to allocate people specifically to that client, and then we have a few people like myself that kind of float amongst and provide kind of that shared infrastructure that works for to support all of those engagements.

As you're

working with these different organizations,

I'm curious how you've been able to scale that operation to be able to have some

maybe boilerplate or cookie cutter technical stacks or technical capabilities so that you're not building everything from scratch every time and having to do some sort of bespoke discovery work and rebuilding everything

and how you've been able to extract what are those common pieces that all of the clients are actually going to need, but make it modular enough so that you're not trying to forcing a round peg through a square hole.

Yeah. I think, luckily, we've learned a lot in our use of paid media to understand the the structure of things of how we want to organize them. So we have a lot of lessons learned from Red Ventures as a whole. This is how we wanna structure our campaigns. This is the typical reporting structure, and we've kind of reused those lessons over time.

As we move into other paid media channels, we're kind of building those playbooks as we go along.

But that governance structure is actually really a big part of what our team does is thinking about,

here are the criteria for data elements we need to even get started before we even engage with the client? What is that checklist of data requirements? If we don't have these data requirements,

we're not willing to even start the engagement. So that's kind of how we protect ourselves a little bit from

getting too far off the rails in any 1 direction or another.

When you're working with these

external customers,

are you typically

building out the technical stack for them, or are they coming to you with infrastructure already in place and you're working to adapt to your practices to fit with their engineering teams, so their infrastructure that's in place and helping them just level up their capacity for being able to do data collection,

transformations,

analytics, things like that?

So we're building it on their behalf, typically from scratch with, obviously, some of those reusable components, but it'll be a scenario where we spin up a dedicated warehouse

and data collection infrastructure and, you know, data integrations with external sources on their behalf, often with the intent of after the conclusion of our contract, we actually hand that back over to their team after we've trained them on the operations and how to leverage it and really to make the most use out of what we've done. Because of the fact that you aren't the ultimate owner of this infrastructure and these platforms, I'm very curious to understand your thoughts on the build versus buy decision and how that might be influenced based on the specific engagement if you're going to say, well, this is the baseline recommendation. We're just going to buy everything. We're going to go with vendor managed systems. But if there's a specific constraint, then we're actually going to swap that out for, you know, a self hosted open source option or anything along those lines?

We try to be very deliberate about those decisions.

Primarily for that very reason is if I have to hand something back over to a customer and all of a sudden they don't want that tool, like, that can be a real problem.

So before we really start to even engage with vendors, we try to have a very deliberate list of answers to questions like, what's the impact of having this tool?

Who's the ultimate customer? What's their problem? How do we know when we've solved that problem? And then what are the characteristics that they need to have to solve that problem? Sometimes we can have the like, a shared tool that we use across all of our clients and we actually use across all of our adventures.

Other times, we really need to be very specific, and they have a bespoke need that we need to match.

So that can manifest in a bunch of different ways. And, you know, for instance, on for when we were looking at observability tools,

we were worried about ease of integration,

getting something up and running quickly for our businesses,

and then having isolated workspaces

because company a's data should never touch company b's data. We're not gonna put it all in the same place and also should be able to separate the alerting. So we looked at a bunch of different options with that criteria in mind, and then, ultimately, like, we landed up, you know, looking at Monte Carlo and then thinking that was the best choice for us. So that, like, customer need and the requirement of, like, being very separate

was a prerequisite for us to really engage with a lot of these vendors and formed a lot of our build versus buy decision. As you were

moving into this role of being a data product manager, working with all of these different

organizations and external customers,

what was your

background going into it as far as familiarity

with the overall landscape of vendors and technical platforms and platforms and open source offerings and all the different ways that they integrate together, and how much of it did you have to just learn in the process of actually trying to build out these various systems selections?

Yeah. A lot of it is building the car as we're driving

it. Selections?

Yeah. A lot of it is building the car as we're driving it sort of scenario.

I really try to come back to that, you know, what is the customer problem? If we have a real customer issue, then we'll start to dig in a particular area.

For instance, right now, we have a customer with very intense

PII considerations.

And so now I'm spending a lot of my time looking into the available tooling to help do data masking

and to make sure that we can have clean rooms where people can do analysis and join on keys but not necessarily expose those keys. And 6 months ago, I had no knowledge of that area whatsoever, but, you know, the need arose, and we needed to dig in and and really figure it out. As you are exploring these new areas and taking the space of data privacy and data security as an example since it's fresh in your mind,

How do you determine that you have learned enough, and what is your process for being able to actually do a

detailed enough comparison between different tools and vendors to be able to make an appropriate decision to say that this is actually the right tool. It's going to be the right tool going forward versus

this is what I was able to get working in the matter of a week, so we're just gonna go with that 1. Yeah. This is where I try to borrow from economics a lot and wanna think about opportunity cost. If we were doing something else, what could we be doing with our time? If a tool takes a 100 hours to implement and we can do something in 20, what do we, can we do with those incremental 80 hours? Switching costs, if we make this decision,

how hard is it to undo the decision? If we choose the wrong vendor and it's a 5 year contract, that's a tough pill to swallow.

And then really the total cost of ownership,

not just in terms of the software costs, but in terms of

hours of maintenance, education, governance, building out metrics, just wrapping your arms around the whole thing and making sure that you're not just a buyer of software, but a steward of it, and you're really taking care of it and and getting the full value you can have. That's really where I try to frame these things is on different time horizons.

What are the costs and benefits of that product likely to be in in a very Bayesian way? Way. And, like, researching those products

is gathering evidence so that way we can have more confidence that we are making the right choice.

To that point of the switching cost, 1 of the things that often plays into that is the specific interfaces that are used both for being able to integrate into a different into a given system, but also in order to be able to implement a given system. So

SQL being kind of the most prominent example where a large number of databases

have SQL as an interface. And so you can know that, you know, with some caveats, if I select Snowflake today and I decide that that's not actually going to be the right solution after a while, I can switch to Presto or BigQuery.

I'll have to maybe remap some of my macros to use different function calls. But by and large, I'll be able to use the same SQL, and I don't have to retrain everybody on it. And

in the case of data privacy,

I know that a lot of systems are built on top of Apache Ranger, so I can use their policy definitions either for the open source offering or something like Privacera

or, you know, maybe there are some

user defined functions that I can add to be able to kind of abstract across these. I'm I'm curious what you're seeing as some of the most useful

sort of choke points or abstraction layers that make these decisions

easier to experiment with knowing that the downstream switching cost is going to be lower? And what are the layers of the data stack that you see as being still too disjoint and requiring a lot more upfront investment to be able to make a decision confidently?

Yeah. I would say 1 of those things that makes things a lot easier has been dbt.

Being able to define our infrastructure

in SQL and then take those same transformation definitions and drop them in any warehouse

has made a huge difference in terms of the way we approach our partnerships.

Back to those reusable components ideas, like, I can define

a data product

and deploy it

to Redshift and BigQuery and Snowflake

all in the same day as long as they have that same base data. So that has been a real game changer for us. I think the places where it's a little bit harder to make those changes is once you start kind of consuming all of the products inside a particular

cloud ecosystem.

So the rest of the company uses

AWS almost exclusively.

And so for them to switch to use a GCP

or

an Azure,

their switching costs are a lot higher

than ours are. So I can see, like, the more you allow

the, you know, using

AWS, but using the identity masking tool inside of AWS, you automatically

like, you make it stickier. You make it harder to, like, dislodge because now you don't have to replace 1 thing, you have to replace 2.

This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy,

Zynga, and others.

Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies.

Sign up for the SaaS today at data engineering podcast dotcom/accryl.

That's

acryl.

In terms of the actual default platform that you've built, I'm wondering if you can give an overview of the specific tools and systems and the architectural and usage patterns that you've developed to be able to

quickly onboard new customers, build out an initial

platform to be able to iterate on, and then what are the components that you leave as a

secondary step to say, okay. We've done our initial discovery. We've built out the core capabilities. Now we need to layer on, you know, some additional privacy tools to make sure that we can do the PII masking and things like that.

The common front door for us is really the eventing. So this is a scenario where we're not gonna ask our customer to

recode their entire site to get their analytics out. So we very often just accept what product they have. If they have Segment, if they have Google Tag Manager, whatever their eventing system is, we're going to accept that. We just, you know, want to get those raw events. We use tools like Fivetran to get a common

ingestion of kind of paid media campaigns.

So that's kind of been an accelerant there on the integration side.

The warehouses, we've had contracts that have explicitly

said you must use 1 cloud provider or another. So very often, it's actually not the technical that decides that. It's actually the contractual that decides it. Things like dbt, we've been able to leverage across all partnerships, and that's been kind of an interchangeable component.

We have some instances where we've actually found Databricks has been a useful tool for kind of the heavier spark compute.

Monte Carlo is a tool that we actually use for data observability

across all of our partnerships.

And then 2 tools that we're kind of in the infancy of figuring out are high touch for data activation, actually integrating stuff back out to external platforms,

and then our orchestration platform. So we're leveraging

Airflow via astronomer

to do all of that. So those are the primary components that we go with on a regular basis.

We do have some clients, like I said, that have a particular need, and they want this specific tool, let's say, for AB testing, and we have to use that. And we'll kinda work with them to make sure that we're meeting that contractual need. In terms

of the

core building blocks that you've settled on where you say these work for the 80 or 90% case,

Given

the fact that the overall space is evolving so rapidly, I'm curious how often you revisit those choices and decide, well, okay. Fivetran works great right now. You know, maybe in 6 months, we're actually going to move to Airbyte because they've got more of the long tail of integrations that we might need. The way that

I have been trained to think about this is and not or. In the case of Fivetran, for instance, when you talk about that long tail of integrations,

that switching cost of making everybody move off Fivetran can be very expensive, especially once you start talking about 10 independent teams working on it and, like, switching these things out. That doesn't mean we can't still look at Airbyte for those long tail and kind of having those things run-in parallel.

So for us, it's the idea of when we're sunsetting something is a very serious decision, but, like, if we want to integrate something else and have this, you know, new addition to kind of meet that specific need that we've very explicitly stated, we'll go down that path and kind of bring them both in. But to your original question,

we kinda line it up with contractual,

you know, here. So if something is on a 1 year contract, we'll typically, a few months before, look at usage, look at governance, look at metrics, say, are we getting the real value out of this product we thought we were? If not,

then we start to have this conversation. Okay. What are the alternatives? Do these alternatives really meet the criteria we were looking for? And then we start talking about switching costs.

Given the fact that you are going to periodically revisit these

selections,

how do you document the original

choices and what the constraints were at that time so that you can

be

as effective as possible at the point where you are reconsidering it to say, okay. Well, these were the constraints at the time. These are my new constraints. So, therefore,

features a, b, and c are no longer relevant. We actually need features x, y, and z, and so we're actually going to switch to this other tool and just being able

to manage that historical record of

what the tool selection process was, what the ultimate decision was, and why, and and then being able to bring that context forward,

particularly as new people come in and people leave and just being able to maintain that tribal knowledge.

Yeah. So we're very deliberate about the documentation of these decisions to the point of when we have calls, we'll record them. When we have discussions, we'll record them. We'll have well

robust notes with matrices of here are all the tools we looked at. Here are the criteria.

You know, this matched 7 criteria, this matched 8. All the details we can throw at it before we make a decision are documented,

And then we kinda store that away. And then when we wanna revisit in the year, we have that available to kind of start off of again, so we're not starting fresh every time.

In terms of the multi cloud aspect of the work that you're doing where different customers may already have an existing workload on 1 of the different cloud providers. I'm curious how that influences your

decision making on different vendors or open source tools to say, well, actually, we're going to prefer tools that already work across cloud. So we're going to bias towards Snowflake because we can use it on AWS and Google, or maybe we say we're actually going to bias towards an open source solution

and go with something like ClickHouse or Apristo because we can deploy that anywhere.

Yeah. It definitely

shows up in that kind of cost of ownership thinking. So we wanna think about not only our cost of implementation in terms of our time and our money, but also when we hand it over to client what they wanna use. We're kind of balancing all of those preferences.

If we have a client who's already on BigQuery and that's what they're used to and giving them another BigQuery

instance isn't a big deal, like, you're gonna have to make a very strong case to break that preference because ultimately, like, they are the owner. They are the person who's gonna be making that decision. So I would say we

are consciously biased towards tools that have multi cloud capability,

be them open source or not. But, yeah, there's definitely a bias there on our part just because of that requirement to support multiple clouds.

In the time that you have been working

as the data product manager and working with these different customers,

given the fact that there have been so many new platforms and tools that have come about, I'm curious, what are some of the ways that the

design and focus of those technological and architectural choices have changed from when you first started working in this space to where we to where you are today? I don't know that our approach to the platform has changed, but our approach

to onboarding new new tools has definitely evolved.

So we give a lot more weight now to training and governance.

Using an analogy here, you give somebody a set of power tools, you don't make them a construction worker.

So,

like, when we give these tools out, we need to make sure that they're invested in training, adoption,

best practices.

Otherwise, if we don't do that, they're gonna fall back to their old habits or just not leverage the capabilities at all. So, yeah, I don't know that our approach to tool selection has really changed over the last year, but I definitely give a lot more credit to the idea of, I we need to give out L and D. We need to make drive adoption. We need to make sure that people understand what the guardrails are for using anything that we bring on. And to that point of the

sort of organizational

and

social capacity

of being able to operate those tools,

1 of the things that you recently wrote a post about and discussed is the effort of starting down the path of establishing

SLAs

for the different datasets and data products that you and your team manage. And I'm wondering if you can talk to

just the

overall process of discovering and understanding

the

requirements and the commitments that you're building

as you're creating those SLAs, understanding what are the useful data assets, what are the guarantees that we need to make, how do we actually go about

ensuring that we are maintaining those guarantees and just that whole journey?

Yes. That is very much an ongoing effort.

I think the most

relevant thing we've kinda learned in doing that has been keeping

the

conversations we have with

teams grounded in lived experiences.

The terminology of SLAs, SLIs, SLOs does not land with our business

teams, and so we need to make sure that we're focused on their lived experience.

We wanna understand things like

what happened the last time things went well. What did you do? What action did you take? What value did you drive? So we have that positive case. And then we really deliberately talk about the counterfactual.

When the wheels fell off, what happened? When were you impacted? What did you have to do? How did it get resolved? Who did you go to? And we're really kind of listening in an almost, like, pseudo therapy session

to help

determine

in the back of our mind what those SLIs and SLOs are. And then we come back later as a data and a product team and say, based on our conversation,

we wanna make these commitments to you. We're gonna check that

this product is delivered at

8 AM every day. And if it's not delivered by 8 AM every day, we're going to

send an alert. And then if it's not resolved by 10 AM, we're gonna open a ticket. And then that ticket's not closed an hour. We're gonna stop any new work, and we're we're gonna resolve it. So the kind of evolution of our process has been very much focused and grounded in this, you know, business conversation of

tell me about

when something went wrong and listening for those who, what, when, where, why

kind of answers

around the data product and what the need actually is. And then it's our responsibility

as a data team to materialize that as an automated check. Going back to the sort of architectural elements and figuring out how that fits into the

technical capabilities of the team that you're working with,

it seems that you're very much biased towards these

event streams coming in, working in batch mode at the warehouse layer with transformations.

I'm wondering

how you have been thinking about the

different architectural patterns of starting to adopt stream processing and moving more towards real time sort of analytical capabilities

or moving from warehouse to a more lakehouse architecture to be able to be more

heterogeneous in terms of the data sources that you're working with, or

as you move further downstream into things like machine learning, how you look to

help improve the sophistication and capabilities of being able to run machine learning products in production and manage the those life cycles and the sort of observability and operational considerations there?

Yeah. I think

where we're headed next is probably more on the machine learning side of things in terms of, like, democratizing some of those capabilities.

We don't necessarily control all of the on-site experiences, so we don't necessarily need to respond in real time to any particular instance. So streaming analytics isn't necessarily what we need to be focused on. But on the machine learning side, we actually use a lot of machine learning to help with our marketing selection. So

detecting

propensity of return, pretense propensity to buy,

clustering potential users that we wanna target, those are all use cases that are very much part and parcel of what we wanna do for our clients.

So something that's on our road map for maybe the next 6, 12 months is actually figuring out how do we build more machine learning models to serve those purposes

without adding

a lot of data science headcount, without adding a lot of machine learning engineers,

and doing so in a way that safeguards

to make sure we're not putting in machine learning models that are gonna do more harm than good. In the process of

building out these

sort of practices for how to do vendor and tool selection,

how to establish and manage these SLAs,

figuring out what are the

cross cutting considerations that you want to apply everywhere versus allowing for customized

engagements.

What are some of the technical and organizational components of that data work that have proven to be most difficult?

I feel like

the challenge is rarely technical,

and the challenge is almost exclusively organizational

for us.

When we introduce a new capability,

we need to convince people that it's going to help them reach their objectives.

The pain of switching is worth it. There's going to be a gain on the other side, and we need to train them to be able to do it with confidence.

So

very rarely at this point has a customer need come up and a technology not existed for it, or we haven't been able to figure out. Almost

all of our greatest challenges have been organizational in nature. And I think that's been probably the biggest lesson for me is

really reframing this technology

decision as it's never just buying the tool. It doesn't solve the problem on its own. The people are solving the problem, not the tool.

As you have been exploring this space and working with such a breadth of the ecosystem,

what are the things that concern you about the current state of data engineering and vendors,

and what are the things that you're most excited and optimistic about?

I think the thing that concerns me the most is vendors who think the value of their product is self evident. You may interact with someone

like yourself or myself who's got a deep technical background and understands the inner workings of the tool and the challenges,

but we may not be the person actually writing the check.

So these vendors that may come in and say,

hey. You know, you're gonna solve this problem. It's gonna be great. And then there's no follow-up for education, training, governance,

showing a value prop. That's been a concern for me in the space. Some of that makes me hopeful is is really just the amount of democratization that these tools have provided.

In the last 5, 10 years,

you can do a lot more as an individual contributor than you could have done

decades prior. You can build entire data stacks on your own with a few external vendors

that would have taken teams of people weeks to do, you know, a decade ago. So

on 1 hand, you would just have everybody

trying to come in and sell you something, and they're not necessarily doing a great job of, like, showing you that value the whole way through and making sure you're actually extracting that value. On the other hand, there are just so many tools. They're so empowering.

It's really allowing teams to move fast.

Do you want to learn how the Joybird data team reduced their time spent building integrations and managing data pipelines by 93%?

Join a live webinar hosted by RudderStack on April 20th, where Joybird's director of analytics, Brett Chawney, will walk you through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible.

Go to rudderstack.com/joybird

to register today.

As you have been engaging with customers and helping them to

create their own analytical

capacity,

what are some of the pieces that have required the most, I guess, handholding or education?

And what are maybe the

pieces that have been most confusing or challenging for your customers to be able to

take over during that handoff process?

Yeah. I think some of the most challenging ones have been around

understanding

when machine learning is involved,

trying to get a sense for

the complexities

underneath the machine that are making decisions on our behalf, and and people aren't necessarily wanting to kinda let go of control. They're used to doing this themselves. They're used to,

you know, taking it by hand through the whole process. And once you start introducing those sorts of automations and automated

roles, they're not necessarily

ready to kind of let go of control yet. So those are the places where it requires probably more coaching.

Training someone on on BigQuery versus Snowflake versus Redshift see, like, like you said before, it's SQL. The interface looks the same. It's like just where you log in.

Those sorts of tools are are pretty easy to kind of transition from tool to tool, But the more complex, the more abstract, that's when teams are kind of need a little more help to get to the point where they understand and want to use the value.

Particularly in terms of that hand off and

moving ownership of the problem to the customer more fully,

1 of the things that's always challenging there is ensuring that you have appropriate

kind of documentation

and visibility into what that whole end to end flow looks like. And I'm

curious how you've been managing that piece of the problem and making sure that as you're building out these solutions, you're actually

ensuring that there is that

cohesive view of what the whole system is doing as a unit versus being able to say, okay. Well, this is doing this piece very well, and then I'm going to jump over to this other thing that completely shift context, and now everything is very disjoint.

Yeah. I don't think that's something we've mastered yet. It has very much been on the shoulders of our data teams to kind of know where all the pieces are and and our product team who kind of interfaces across all of these industries. They really plug in when there are questions about these sorts of

integrations between systems and, you know, the water's edge for what someone's familiar with.

I think the way we're fighting against that is actually

kind of tool by tool and thinking about different personas

within that tool. So

for

a particular tool, I only may need to be a reader versus an editor versus an admin, and that each of those levels

have different educational requirements where I need to teach them about

30 aspects of the tool versus 20 versus 10.

And so we actually take time as an internal team to deliver content

both in, like, live education

as well as documentation

and recorded videos.

So that way, when people have questions and they wanna go back and ask about how to create a Monte Carlo monitor, like, we have internal videos walking them through not just how to do it in general, but our specific application of doing so and how to meet our naming standards and how to route things appropriately to the correct channel. So we're thinking about it on a kind of a tool by tool basis, and we're trying to create content,

self serve content that guides people to the level of education that they need,

not everything at once. I'm not expecting them to know how to manage the cloud infrastructure for

the 3rd party tools we're leveraging. That is a small group of people that need to worry about that. But my business analyst, they have maybe 5 things they need to learn in each tool, and that is something that's much more manageable to kinda deliver to those teams.

And as you have been

working in this role and working with the

different members of your teams, I'm curious how you think about

organizational scalability

and ongoing

sustainability

of this work and how you're able to maybe, like, build these fly wheels that make it easier for you to do more things faster. So you mentioned these sort of content libraries for your customers, how you're able to build those out in a way that they're kind of plug and play where you build it once and it's applicable to multiple different engagements versus

demonstrating the workflow for a specific customer and then having to redo virtually the same thing over for the next customer?

Yeah. So that's where it gets into the idea of being kind of a steward of the tool rather than just a, you know, a purchaser of it. Really try to push this onto our product team. So thinking about what are the best practices for that tool? How do we communicate those best practices in a library of content you're talking about? But then

also, how do we check that? What are the automated ways to go in and see that people are being adherent to it? We have a naming convention in a system. Can we extract out all of the records, you know, all of the rule naming

conventions, and then give me a list of all the ones that aren't adherent? And then somebody who's the owner of that tool is gonna go kind of out and resolve that issue.

So it's education.

It's governance. It's metrics. It's automation. It's, like, really getting to a point where we're just not yes. We're handing tools to people, but we wanna make sure that the bumpers are on the lane and they're not misusing things

outside of our expected use cases.

As you have been working through these engagements, what are some of the most interesting or innovative or unexpected ways that you've seen teams building these data systems?

So we have a team of probably 2 or 3 business folks that have been

probably

the most impressive

group of folks I've worked with in a very long time.

People with

business backgrounds,

given the tools like dbt and Monte Carlo and Airtable

have been able to create what are the equivalent of CRUD applications

that manage

entire

marketing campaigns on multiple ad platforms.

So what would have taken

a team of engineers

a few weeks to do, you know, they might have gone down this path and they weren't sure there was any value.

We've been able to empower business teams

to

create that value, prove that it's there. And then when they kind of reach the limits of what those tools can provide, then we can go to the engineering team and say, this is kinda what we wanted to do, but we have these extra requirements that we can't quite figure out. Can you help us build it? So I've been really impressed by our business teams

and their ability to really adopt some of these tools and make kind of cool, combinatorial

things that have really

saved themselves time and and just been impressive for us as a whole.

On that

example of

being able to take some of the existing tools, add in something like Airtable, and get to 80% of the use case. 1 thing that I'm curious about is how often

the extra set of requirements,

while seemingly benign and sort sort of incremental on what you've already built are actually going to require a significant amount

of reengineering

because it's something that is so divergent from what something like an air table or whatever provides. It puts me in mind of the xkcd comic about, you know, oh, I've got this app. I wanna be able to, you know, see where the user is when they take the picture. Okay. You know, that's an easy feature out of and then I wanna be able to tell what kind of bird they took a picture of. Oh, that's gonna require 1, 000, 000, 000 of dollars and months of research.

And a team of PhDs.

Yeah. So

thinking about that is this this gets back into the opportunity. They may have a requirement where this thing, this extra feature, this field validation to meet, you know, naming expectations

would be great to have.

You know, we need to have a frank conversation of it's gonna be 6 weeks, and we can be doing something else all the time. And that's an active conversation we have with all of our teams. But to the point, we still have 80% of the value. We don't have 0.

0 is not ideal, but 80 is much better than 0.

So we really just try to be very frank with those teams and say, it's great what you've done. It's really impressive.

In order for us to go the next mile,

I don't think you're willing to pay the cost. So we're gonna have to stay where you are. We have to make that trade off. In your own experience

of working in this space and building out these platforms and building up the team, what are the most interesting or unexpected or challenging lessons that you've learned in the process?

So I think

the challenging thing that I'm still learning today is how to

pick the right level of technical detail based on the audience.

If you're jumping from talking to senior executives, to engineers,

to

procurement, to legal all in a single day, you're talking about the same pool. If you

don't switch context and switch, like, the grain at which you're talking about,

you're gonna lose cycles of time. You're gonna you're spin your wheels. You're not gonna be as persuasive as you could. I think that's something that at these roles where you're just at this confluence of a bunch of different

functions,

you need to be very good at switching

to the right level of detail.

In terms of the ongoing work that you're doing, I'm curious what you have planned for the near to medium term as far as areas of investigation or new systems to explore and develop. As part of that, I'm also interested in what you're seeing as the

most difficult layers of the stack to make informed decisions about.

Some of the areas that we're looking into

include

the democratization

of machine learning,

implementing some better patterns in terms of data privacy,

and allowing teams to do analysis in a privacy conscious way.

I think this is the biggest 1, figuring out paid media measurement

post cookie world with IDFA changes, with things coming out of Google Tag Manager, a lot of digital marketing

is in for a rude awakening in terms of

what are you gonna do when you can't actually know who if the person that clicked on your ad made a purchase? If you can't draw that line together, how do you value that? So that's

bordering on the existential crisis for us as a company and trying to figure that out. There are some techniques out there they're working through. That's probably 1 of the areas where we need to do the most learning. In terms of technology,

again,

yeah, I think that that's probably the list of things that we're looking to next.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think the biggest gap is probably

trying to answer the question of how much is this data worth. How valuable is it? We can talk about data being delayed. We can talk about being misconfigured, but what is the dollar value of this table? So that way I know as a team,

it's the 1 to prioritize,

or it's the 1 we need to be protecting the most in terms of privacy and security risk, whatever those metrics are. And we talk about data being valuable, but we never actually put a number to it. And I think that's a huge gap for me in terms of being able to prioritize the right work. Yeah. It's definitely an interesting view to it. And, also, it's not a static answer because this table is worth,

you know, $1, 000, 000 today because it drives our board for determining

how much inventory we wanna have on hand. But

6 months from now, it's also going to be driving a machine learning model that's actually bringing in extra revenue. So now it's worth $5, 000, 000. And so being able to have those feedforward and feedback mechanisms to understand, like, what does that metric compute to and how important is it at this point in time?

Yeah. Exactly. It's a dynamic function we need to solve for, and, I don't think there's a good way to do it yet. Absolutely. That's definitely a complex problem and something probably worth a whole suite of PhDs on its own.

Exactly.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at RED Ventures and RED Digital. It's definitely a very interesting problem space and interesting

view on the domain. So I appreciate all of the time and energy you that you and your team are putting into that, and I hope you enjoy the rest of your day. Appreciate that. You do the same. Thank you.

For listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links