Taming Complexity In Your Data Driven Organization With DataOps

Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/lunote

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

If DataOps sounds like the perfect antidote to your pipeline woes, DataKitchen is here to help.

DataKitchen's DataOps platform automates and coordinates all the people, tools, and environments in your entire data analytics organization.

Everything from orchestration, testing, and monitoring to development and deployment.

In no time, you'll reclaim control of your data pipelines so you can start delivering business value instantly without errors.

Go to data engineering podcast.com/datakitchen

today to learn more and thank them for supporting the show. Your host is Tobias Macy. And today, I'm welcoming back Chris Berg from Data Kitchen to talk about ways that DataOps principles can help to reduce organizational

complexity. So, Chris, can you start by introducing yourself? Hi. My name is Chris Berg. I'm CEO and head chef of a company, called Data Kitchen in Cambridge, Massachusetts. And we focus on a new area in data engineering and data science that we call call data ops. And so we have a have a software product, and I've been trying to evangelize the concept now for 5 or 6 years. And do you remember HayFirst got involved in the area of data management? Well, to buy side, you been sort of involved in technology for almost 30 years, and the the first 15 years of my life were more sort of software and and algorithms. So I worked at NASA Ames on a project to automate air traffic control.

I worked in

enterprise software and actually 1 of the 1 of the first business,

you know, consumer websites that did recommendations

and social chat. And then about 2, 005, I sort of left the world of sort of writing code and managing people who wrote code and then jumped into the world of, you know, what we call now called data engineering or data science or data analytics. Then I worked for a company that did kind of health care analytics, and I was the chief operating officer of that company. And so I had people who, did what we call now data engineering working for me, people who did BI or amp data analytics, people who did data science, and those were all we used different words for them 10 or 15 years ago. And, you know, my life was

just trying to make the trains come on run on time, and I sort of live this life of kind of the you know, it's somewhat the trio of pain of people who deal with data and analytics. And, you know, the first pain comes from your data providers who sort of don't care about you and give you crappy data, and you've gotta deal with that. And then the the the second pain really comes from, you know, the customers of analytics. And, you know, you can't sort of build them a house and walk away. They sort of constantly have a river of questions that you've gotta answer. And then, you know, the third pain is kind of sort of a subject of of this talk is that the people

who your teams have to relate to, whether it's sort of production or other teams in the organization is is complicated. And even just working with your teams on how having them relate to each other and coordinate. So there's this organizational complexity that that creates these kind of trio of pains. And then my cofounders and I, we all sort of have similar sort of software and data and analytics hybrid backgrounds, both kind of a very, very technical. I've written a lot of code and also have managed people. We started this company, Data Kitchen, about 6 years ago when we I I guess, realized the that set of pain that we had

lived in our sort of analytic company that called Leapfrog RX was a generalizable

problem to people who are doing data and analytics in general. And so,

we've

been on a mission, and we call that the solution to that pain data ops. And so we've been working in this sort of or talking about the concept of data ops, and we have a software product that does that and, have built a company, devoted to that mission. And so in the space of the data

responsibilities within an organization,

depending on the size of the group and the specific tasks that they're trying to perform, you listed a number of different responsibilities

in terms of data engineers, data analysts, BI engineers,

data scientists.

And then some organizations that might get broken down even more into sort of things like the person responsible for data governance.

And I'm wondering what you see as being some of the typical ways that those data and analytic teams are organized, and some of the specifics of the roles and the structure there. I wish it was a typical way. There's a lot of varied ways. And so and there's even it goes down to the naming. Right? Because, you know, data engineers, they used to be called, sometimes they were called DBAs. I used to call them ETL engineers.

You know, some people call them data plumbers and data cleaning. And and,

you know, I think 1 of the great improvements in the last 5 years is the elevation of the data engineering as a full fledged profession. And I think that's that's fantastic. And then, you know, and and their their job is mainly to take take data and and put it in a a place in a meaningful format,

for others to use. And and there's another group of people who we call data scientists or some people call them AI people or, you know, they apply algorithms to that data. And then there's another group. I usually call them people who do data visualization. Sometimes they're called analysts or people who do BI, and they take and try to explain the data in a visual or other means to business customers. And then as you said, there's sort of governance and security people. So you've got data engineering and data science and data visualization and data governance who are all the main sort of components. And then the managers associated with that. And there's just lots of other roles that you could put in there. You know, you could say a statistician is a form of a data scientist or there's there's managers in there and scrum masters. But generally, there are these kind of 4 main buckets of roles that people have in organizations. Now the question is,

who where do those 4 main buckets live in a company? And so there is a role that's actually really increased in number over the past 5 or 7 years called a chief data officer or a chief analytic officer. And he or she is in charge of what, some people call data offense and data defense, trying to actually get value out of data in an organization or make sure that the data that's in an organization doesn't hurt it. And so I think in some organizations,

the data engineers and scientists and biz and governance all work for that person and he or she, you know, is in charge of the data and accessing the data and putting it into systems that deliver value. And that that's sometimes the case. Right? They all work for 1 person.

But it's actually not quite the case because sometimes

these roles end up in different parts of the company. So sometimes you'll have the data engineers and the data sources that they get from, like, you know, people, the CRM systems or ERP systems where the data comes from. They live in broadly under the information technology, and that is generally headed up by a person called the CIO or chief information officer. And then sometimes the people who are actually trying to take that data and deliver value out of it for the business, Sometimes they don't even live in IT. They live in what's called the line of business, and maybe they work for a marketing team or a sales team or a CFO. And so they're not on the same team. And sometimes your data science team is so new, they may not fit anywhere. Maybe they work for the CEO. And so the different groups and the different roles, you know, if you imagine that there's these 4 main groups, they sometimes have affinity. Like, data sometimes

is closer to IT, and government is sometimes closer to IT. Businesses

at the it's becoming more and more common with the aspect of self-service tools that the people who deliver

visualizations from data or explanations on a data. They tend to be mapped more towards the business side, the line of business side. But it's you know, there's all counterexamples.

Right? There's complete line of business teams who do everything, all the sort of 4 major roles. There's and so it's kind of like it. You could take these 4 roles and take the way companies are organized and sort of pretty randomly distribute them, and you probably find somebody who's got the organizational model. But I do think over the next 10 years, more and more companies are saying, look, we wanna be data driven. We wanna get inside out of data. It's probably better to have the whole

factory and value chain of data under 1 person, and then that person is is in charge of making sure that we're getting value out of our data and that the team is is set. And and so we you know, I talked to there's teams of 40, 50, 400 people who are all 500, a 1000 people who are all trying to get value out of data. But net but that, you know, may not always be the case. And 1 of the things that seems to contribute to this variety in where the different roles and responsibilities might sit and their reporting hierarchies is that a lot of the types of work that's being done provide value at different areas of the business, but there's also still a lot of

unknowns throughout the business as to what the capabilities and what the needs are of those different roles and the best way to structure them because they're such new positions. So data science, there have been statisticians

for ages, but the newer capabilities with things like deep learning and machine learning are viewed as sort of magical solutions to everything. And so maybe the CEO wants to have a direct report who's able to answer all of their questions without necessarily

fully comprehending

the dependency chain behind them. And I'm wondering what you see as being some of the common points of confusion or lack of understanding

to how the different

stages of data play out and how that contributes to some of these elements of complexity and how it manifests within the organization?

Yeah. There's a couple of answers to that. Right? So, you you know, for all of for the listeners to this podcast and and you and I, we sort of have this strange gift that, you know, and I don't know what percentage of the population, maybe 5 or 10%. Who are these high abstractors who can look at data and can really,

you know, wanna spend 4 hours or a day digging into it and then can express that in a way that under understands. But, you know, a lot of people, they don't have patience or they're just not interested or talented or whatever their their ability to kind of sift information, and they're looking for someone to do that work for them. And so, they wanna trust that person to give them an insight that makes business sense. And so your average business leader, you know, they're they're getting more and more and and, you know, there's this term data literacy. But it's pretty rare that you've got someone who's, you know, the sort of top levels of companies who really spend a lot of time, you know, pivoting around data or know the differences between what kind of clustering algorithm or segmentation algorithm you use. And so, you know, we have these teams of people who are trying to put this together. And so we've got this sort of strange skills that we can, you know, abstract things in in data or abstract things in code and and and get value on it. And and so 1 aspect of that is that since we're a small part of the population, we think, but I've come to learn that it's actually not true. We we don't really know what the people want, what your business customers, what the bigger picture is, because it takes so much work and so much effort to sort of rest inside out of data and wrestle with complicated infrastructure that we think we know what they want, but we actually don't know what what they want. And so

1 of the biggest precautions that that I think is that you just have to be humble and and believe you don't really know what the heck you're doing. And the only way that you can do it is to get feedback from that 90% of humanity, put something in front of them in a

file, a dashboard, something that may not be perfect and say, what do you think? And then have them guide you onto where the next step of value is. And so that's the, you know, 1 part is that, you know, the us and sort of the the the cult of abstraction who can who can do the work have to really be humble of our ability to know what our our customers want. And just because we took a lot of calculus in in college doesn't necessarily mean mean that we're smart. And so there's a lot of bigger pictures and and individual business decision making that we become cognizant of. And then, you know, the other sort of interesting factor on all this in in addition to sort of iteration and agility is the factor that the tools that people have, we've created in the market have kind of grown into sort of, I would think, 2 buckets. And and there's a set of tools where it actually makes it pretty easy to to do stuff with data. And that's not only Excel, but there's a whole bunch of, what I what analysts call self-service tools. And it's like Tableau and and Looker, and they can actually you can pivot around and get insight from from a a good dataset and then sort of push it to production with a a button. There's a whole set of data prep tools that allow you to, and those are ones like, Alteryx

or several several like Trifacta, Paxata, where you can do a lot of data work, from a nice visual UI and has some algorithms behind it. And it's sort of, you know, data engineering for non data engineers. And even there's a whole set of self-service data science tools out there that try to do what's called AutoML to try to help you do it. And so and then, of course, there's the more technical tools that are backed by, you know, code and and or or visual UIs like your Informatica or your SaaS or programming languages or SQL and and all that. And so the class of tools, there's this new class of self-service tools that enable people to do a lot of work. And so,

and then also there's a lot of consulting companies that have come along in this boom area to say. And so if I'm a business person, I'm like,

I wanna get some insight. Do I talk to my IT people? Do I hire some people to use some of these self-service tools? Do I go call up a consulting company and pay them? And so it's a it's a complicated world, right, if you just put yourself in the shoes of, you know, I'm a VP of something or other, and I hear this data thing is important, and I I wanna get some of it. And so I think that, you know, makes a challenge,

also in in terms of how organizations react to data. Yeah. The introduction of self-service

in particular,

I think, contributes to some of this area of complexity because

there's

a high drive be able to get answers from something, but they don't necessarily

want to have to understand all of the different stages that happened before they started asking questions of a particular data set. So they might not necessarily

traverse the entire lineage of a set of data to understand what are all of the additional pieces of context that come along with it. And so then it pushes a lot more of the upfront responsibility

of providing that information

as a readily available piece of the overall data and the way that it's presented to the data engineers and the data analysts and everybody who is working to get the information to a point where the self-service tools can have access. And I'm wondering what your thoughts are on just some of the ways that

that and other elements within a business contribute to the organizational complexity of the data responsibilities.

Well, so,

you know, if we if we think of these roles, data engineers and scientists and, people who do this and governance is kind of 4 categories. Right? And then, I think that the the work that they do could be expressed in pipelines or maybe meta pipelines. And all those tools themselves are sort of not just data pipelines, but data science pipelines and data visualization pipelines and data governance pipelines. And so those pipelines themselves, if you think of them, you know, they have different tools that people use that make up those pipelines. So it may be Tableau and and

trifecta

and and 1. It may be, elation and another. It may be, you know, Python and SQL and a third. And so the the they sort of pipelines that are all over the organization, if you look at it from that metaphor. And they're all sort of processing chunks of the of what happens with the data. And so, like, if you take that model, pipelines in the biggest term. Right? That everything's a pipeline, even self-service tools and even governance is a pipeline. Then the process of

getting value to your customers is kind of almost Lego like connecting these pipelines together because they in some cases, they're whole and complete. Like an IT team can centralize a data warehouse and and own the Cognos reports on it. But in other cases, you'll have partially a data enablement team doing a data lake, and then you'll have another set of machine learning engineers with a feature development pipeline and a and a model. And then on top of that, that's, for instance, that batch model, there'll be some visualizations done by a group, and then you'll have the poor data data governance person trying to put all the things into relation at the end. And so, you know, these of pipelines are are distributed across the organization. So if you think of that, you know, simple model, like, okay, there's an org structure that so these teams themselves and their different roles work on pipelines and these, you know, that they each own pipelines.

And 1 aspect is that they're the organizational structure of those pipelines

is, reflected.

The organizational structure that the people work in is reflected in those pipelines. And so,

you know, what is that

what what is that reminiscent of? Well, if you think of Conway's law, right, the sort of 50 sixties guy that said the structure of an organization is reflected in the structure of the technology it's built. And

when I was a software developer, you would,

if you said, oh, that code is built up because of Conway's law, that was not a compliment. Right? That was a that was something that was bad because you built the interfaces

in in your

code, not representing the optimal way that the code should be structured, but just because, you know, some VPs in a room divided up your group this way. And I think that's also true in in the data and analytics pipeline world that there are these pipelines and people control their pipelines. And you have to sort of connect these pipelines together to get business value to your customer. And so that makes a really weird perspective for your business customer. Because first of all, most nontechnical people don't understand the discontinuity

between something's why is something easy and why is something hard? Why is changing a line chart to a bar chart can be done right in front of me, but actually adding a new data file takes 4 months? And it it, like, drives them crazy. They don't understand the discontinuities. And and I, you know, I've spent years trying to explain this to people, and and they just it's it's hard because it's you have to understand how the thing works to make it happen. And that's part of the reality is what level and what pipeline it goes into, the complexity goes up. So it's easier sometimes to think about

do things quicker. So if you imagine,

like, let's take a scenario. I've got, you know,

a VP of sales and they've got some dashboards and then they've got some models behind those dashboards and some

segmentation of their customers. So once that segmentation could a segmentation of their customers. So once

that segmentation could be created from a data science model

and you could do a random forest

on the data and get this sort of complicated series of if then else's and apply that to be to do the segmentation of your customer. Or maybe there's some other algorithm that you that you wanna do a k means clustering. But you'll end up with something that's kind of like an if then else. Or maybe you just go into Tableau and create a segmentation in the if then else. Or maybe your data scientist creates an attribute of a dimension or your data engineer, excuse me, creates an attribute of a dimension in a Kimball like fact table.

Or maybe somebody on access to the data creates an if then else in Python. So, like, where that if then else lives is actually really relevant to what layer of the stack or what pipeline it lives in. It's in relevant discussion to the design and sometimes things and often gen to generalize things, it should be lower in the stack because that is has more people upon it. But the cost of doing that,

is also plays out. So maybe it's just easier to do that if then else tweak and do it in Tableau, give it to the VP of marketing and say, what do you think? Does this make sense? And he or she could go yes or no, and maybe you solve the problem. But then, you know, there are these data pipelines, but they are the logic or the code or configuration

that happens in all these diverse pipelines. And so that in and of itself is, an area that that DataOps talks quite quite a bit about. So, you know, organizations

have different roles. Those different roles

create pipelines.

Those pipelines live in different areas of the organization.

And then the design decision about where the interesting work happens, the if then else, the logic is is,

I I think sometimes not done, it can create challenges that are represented by the same challenges that that Conway's law found. Yeah. The communication

between pipelines and between members of the organization

is definitely something that can very easily get reflected in the ways that data propagates throughout because you have somebody who asks for a particular piece of information,

and then you say, okay, well, I'm going to add it into this report, or I'm going to collect say, okay, this pipeline is complete. Here's the hand off to the next stage, which is

by which you could say, okay, this pipeline is complete. Here's the hand off to the next stage, whether it's from data engineering, collecting all the data, loading it into an s 3 data lake or into a data warehouse, then handing it off to data scientists to run their analysis.

And then another element too is that a lot of the

work being done in these pipelines

is optimized for moving forward, but then it can be difficult

to have useful bits of information propagate back

to form feedback cycles to determine is this pipeline working effectively? Is it providing all the information that's necessary? And then by the time it goes from the source data through, you know, maybe 5 different levels of pipelines

to the business end user, maybe the CEO, who then says, okay, well, I've got another question or this doesn't quite answer the question that I had. You've already spent a lot of your lead time

getting everything to that point, and so you don't necessarily

have the time to be able to go all the way back to the start, optimize all the pipelines, and get the same answer back out again. And I'm wondering what your thoughts are on some of the useful ways to create those feedback cycles

and shorten that iteration,

and some of the useful interfaces

for designing those communication patterns both from a technical and an organizational perspective.

Yeah. It's a really that's an important question. Right? Because, like, if you think of the this case where there's, you know, everything sort of a pipeline as a metaphor. Right? And there's data engineering, data science pipelines, and data visualization pipelines. And there's, you know, if you're doing trying to do something good for your CEO. Right? And and, oh, it's the CEO.

Let's engineer it right. Let's put that at the lowest level in the stack so everyone can use use from it. And that that's probably a good idea, right, from a technical standpoint. But from a is it actually valuable? Right? And you don't know if the CEO

likes it or not. He or she may wanna tweak it and tweak it again and tweak it again until there's it is right. And so putting at a lowest level of stack and taking months to get it out could be exactly the wrong thing. And so it's really it it is really unknown about how much work it takes to actually

say that you should do something at enterprise scale or at the lowest level of the stack. And this may be heresy to people who believe in enterprise architecture, but I think it's better to for things to to try to, have. If you're gonna invest in things from the sent center out, it should earn its way down instead of abstractly being able to get it there from the beginning. And so if you're gonna do an f n else, do it in, you know, do it in, the as quick as possible in the data prep layer or the data visualization layer and see if it's right. And then when it gets when it earns the right to, be useful, then have it go back down further on, into the lowest level because of that fan out problem, because of the the the centralization

versus freedom problem. And so the other part is that the org structure may not permit that. You may actually not be able to sit and and say, okay. CEO wants x. Where are we gonna put the if and else? Is it gonna be in Tableau, or am I gonna model it, or the data science gonna data engineer's gonna do it? Then you may not be able to have that conversation. Right? Because they you may no 1 owns the relationship.

Right? And so there's no like, there may not be a c CDO who can kinda sit and say, okay. We're gonna do it. I know what the seat CEO wants, so we're gonna do it. This is really important. We're gonna do it the lowest level, and or we're just gonna get it quick the next day. Could you just flip it up in Tableau. And so these design discussions have end up almost being territory level ones. And I've I've been into discussions where

very mature people I respect

data architects say, I only do enterprise

architecture features, and they weren't interested in anything that wasn't

didn't have that sort of more fundamental 1. And, you know, I I can see their point. And then I've sat in rooms with people who, you know, learn are really, really good at Tableau and think that those enterprise people are complete idiots and that they're really focused on trying to iterate things.

But then, you know, they end up wondering, well, why do we have 3 definitions of market basket across the company or 8 definitions of the same term? And so everyone's wrong and everyone's right in this equation.

And so it's, it is really a very fundamental thing that happens in analytics about this relationship between the centralization

function and the distribution

and how those pipelines are managed. And then fundamentally, like, how do you collaborate? Right? How do you connect all these things together to make it right? And so there is 1 way that says you should have a lot of meetings and have a lot of documentation.

And

when things go wrong, when that CEO

gets the if then else, and then 2 weeks later, it breaks and

spreadsheets that you've gotta have checklists and and so that things don't go wrong. And spreadsheets that you've got to have checklists and and so that things don't go wrong. And that's not unreasonable. Right? No 1 likes to get yelled at by their customer when things go wrong. And data is a business where things will always go wrong because it's a, you know, it's a factory like system. And so that's 1 way. You just start having more meetings. And then once you get more senior or you become 1 of the people who are smart, you get pulled into more and more meetings because you've got more and more of the picture of the whole system in your head. And pretty soon, like, half or 9 tenths of your time is and you're the most valuable person in the company is spent in meetings, line up, reviewing code and and making sure everything's right. And then you get frustrated, and you're like, oh, this sucks. I'm gonna go get a new job. And that most talented person walks out the door. And that's just not a great situation because then no one's got the whole pipeline

organizational

data map in their head anymore. And so I think that that is a very common circumstance

because of this centralization

versus freedom,

self-service

versus home office,

IT type type world. And it also gets further complicated because another person in this in this world is the is the operations team. The people because you're creating you know, if you talk about where that if then else goes and which 1 of your pipelines, every 1 of those pipelines has a development stage and a production stage. And so when you take something and put it into production, then that's a different there's different paths. Right? In Tableau, I can click a button. Oh, it's live. Many organizations have a dev and a QA and a UAT in a production environment, and it can take them 3 to 6 months to take 20 lines of SQL and and get it from 1 state stage to the other. And so it's not only the organizational complexity of the people who are creating it and where they live in the organization, but also the production team who runs it. And so it's almost so much nicer in software. We've got this nice 1 to 1 relationship between dev and ops. It's so nice. You know, I've got my ops team on. I've got my development team. We've got a front end group and a back end group. Oh, yeah. That's maybe there's a nice well defined GraphQL interface between them, but that's, you know, that's fantastic. But, like, you've got this many to many relationship because there's not just 1 ops team. And that's there's sometimes multiple ops teams because there's a central 1 and there's line of business ops, and then there's development team running their own ops. And so that becomes,

another challenge as well because,

basically, I 1 of the main common threads across anyone who does data science and engineering visualization is they wanna do new cool stuff. They don't wanna, like, be fixing

data problems on Saturday. Like, they want an op they want someone else to run the operations or they wanna, you know, wanna run it hands off. Yeah. I could definitely,

sympathize with it not wanting to solve those problems on Saturday as somebody who does work in operations. And, I think your point too about

the relative simplicity

of the software development life cycle of you write your code, you put it through your CI pipeline, and then at some point, it makes it out into your QA environment eventually to production. And I'm wondering what your thoughts are on the relationship

between the

current trend towards trying to break some of these different software applications down into microservices

to simplify and reflect some of the organizational communication patterns and the viability of a similar approach in the data and analytics space given the multivariate

layering of complexities

of the different pipelines, both in terms of the code and the data and the organizational requirements on top of that. Yeah. It's interesting because if you look at the microservice

model, right, sort of monolith versus microservices,

microservices have the sort of quote 2 pizza team that does everything all the way down from from UI all the way down to database schema. Right? And you can kinda componentize them together. And, you know, that that's a that's a great model. Right? They're sort of vertically integrated teams that can run. And then you've got the lateral communication challenges when you've got a bunch of microservices

running. And so, you know, both both have their mix of architectures. Right?

The the the monolith that's back end sort of you know, where it's back end, front end, DevOps group, and that works. But then you've another where you divide it up in the services. And I think in the data and analytics world, it's it's a little bit different because you have

it's as if you had teams that did some teams part of the application was done with a front end combined, where some of the teams use low code tools are doing both back end and front end work. Some of the teams are doing low code tools doing front end work.

There are some people who are doing high code tools with IDEs who are doing back end work, and they're all they're all working on this big application. And they don't even some of them don't even know the others exist. And so

there's not a sense that they're in the business, the same shared business together because there are literally, in most organizations,

30, 50% of the people do something with data and analytics. Maybe they're a consumer or,

you know, in more and more organizations, they're trying to actually take and and get insight from data themselves. And they may be in all sorts of different organizations where they they do an export and get data and integrate with another dataset. And so, we're all a part of

so in some ways, I I see data and analytics ahead of

software engineering and the organizational complexity or maybe behind. I don't know.

Because, you know, certainly software is going to get to you know, there's been a lot of low code tools like Mendix out there. There's been a lot of sort of SaaS products where you can do sort of partial in integration

tools.

And so we're we're getting into a world where software itself is

diversified. And and, you know, the value chain that someone gets in an app come could come from a low code tool and a SaaS app and an enterprise app all sort of woven together. And and you've gotta be able to do kind of CI and CD on that. You've gotta be able to deploy it. I think that same structure is is happening has already happened and is completely out of the gate in the data and analytics world. This is the world. There's just so many different groups, doing different pipelines in different organizations. And if you wanna have a hope, like my company is trying to get people to be able to deploy quickly and to be able to deploy with low risk and safe. Most organizations

now, they'll have a data warehouse team, and they'll have no idea if they change the name of a existing column, what the damage is gonna be. The only way that they can find that out is by going out and sending email and talking to people.

And, like, that's just to me, that's, that is a suboptimal way of handling collaboration. Because I I you know, I've been a technical guy now for freaking 30 years, and I know my technical people. We don't like meetings, and we don't wanna, like, coordinate with other people. We want a button that says, if I change this, did other shit break? Like, that's the button I want, and everyone wants that. And so, you need to empower people to have that button saying, okay. If I change the title or the, if I change a column's name,

what's the what's the effect of all my other pipelines that are my data science pipelines, my data visualization pipelines, my data governance? What's the effect of that? And when you've got all these pipelines floating all over the organization with no coherence to them, you have no idea, and you have no idea of of any sort of effect. And people's and what frustrates me is people seem to think, that's alright.

That's the way the world is. We just live with errors, and then we'll we're reacting on Friday. And then you wonder why there's so many people who are have gone in with high hopes from getting their data science master's degree. You come out really frustrated or why the number of data and analytic projects fail over 50%,

and why the number of companies who are data driven or reporting data driven are actually going down. And so there's a lot of reasons, but it's not because we don't have cool tools. It's not because we can't run a database query faster or build some ETL code faster. That's not the reason. The real reason is because of these,

people and process and organizational

challenges. And if we don't face up to that and and

take that as a primary thing in our work, then I think we're gonna be, you know, we're gonna be in a world of hurt. And I think that's what happened in software. Right? We were, you know, it was the dev and ops where we're the developers were throwing code to the poor operators who we thought were lesser beings, you know, back in the nineties. And we when I was a software engineering manager, we threw code over the fence, and and,

we'd walk away. And those people who did ops were, like, paid less than the cool job of developers. And, you know, 20 years forward now, DevOps engineers make more than software engineers. And that's that's the way it should be. Right? Because they're actually handling a big amount of complexity. Or maybe it shouldn't be. I don't know. But, like, at least they should be parity. And that relationship between dev and ops is fundamental. And so the relationship

between all these pipelines at writers in the organization and their operations is fundamental. The many to many relationship is is fundamental in how data and analytic teams work. And if we don't solve that, we're just gonna be continuing

the process of sort of techno fetishism where we believe the next algorithm or the next tool is gonna actually make things work. And that's something I'm just, like, entirely sick of. I'm just really sick of the belief that if I buy another tool, like, this all will happen and all. And I'm also sort of sick of the BS of vendors that they go. And I go to conferences and see it all the time. And so there's a set of things that we're not focused on.

And organizational complexity is 1 of them and operational

the the the way at which we push things to to production and the operationalization

of models and data engineering is also another 1. And so those things, I think, are really to me the surface of attack that we've got to focus on.

And, you know, I'm trying to to talk about these things and give a perspective on them because just having been around it, I've seen it so many times. And it's a different it's it really is sort of a mindset change that people have to go through to to sort of think of you know, because normally you grow up as a technical person. You're like, oh, organizational stuff. Oh, roll your eyes. You know, that's for, like, managers. I don't have to deal with it. But, you know, I think, you know, I think

the DevOps idea and software was that there was an organizational

barrier that needed to be broke down. And I think the organizational barrier that needs to be broke down in data and analytics is not just dev and ops. It's the multi pipeline all over the place,

multiple tools,

centralization versus freedom versus production problem. And it's that many to many problem that we've got us that we've got to solve. And then, you know, the secondary or maybe

or at the same level, you know, sort of deployment and CICD and testing and environment management problems.

And another element

of the

multiple different pipelines that all of these different teams build is that a lot of the time, they'll end up duplicating efforts and sometimes with

different context or different concepts of what it is that they're trying to solve for. And then you're left trying to reconcile

why there are so many different ways of getting at the same answer. And so we've tried to solve this through things like master data management,

and then there are tools like superglue from Intuit for being able to see what are all the different steps that go into a given report.

So you can say, okay, this report has this value or this report didn't update. What are the different stages that happened before it so that I can go back through to try and debug the problem?

And so there are some approaches to it from a technical perspective and tooling perspective to address the issue of surfacing these sort of canonical set of data or the canonical way of doing things or saying, I've already done this. You don't need to do it again. So things like data discovery with the Amundsen tool or other things. And I'm wondering what your thoughts are in terms of

the design problems

in the tools and the technical systems that we use that contribute to some of these organizational and operational challenges

of duplicated effort or

lack of the visibility in terms of the data that is available across the organization?

I believe that

the the center of problem relies in the processing of the data and not the data itself. And so the processing of the data is a code and configuration

driven system. And so in every 1 of these pipelines, there is code and configuration that is running. And maybe that code runs in a, like a like a Java VM,

or it runs in a Hadoop cluster or in a database or in some ETL engine or some visualization tool, or it just, you know, runs in Python. But it's all code, and that that to me is the perspective that we need to focus on sharing and reusing and centralizing

and distributing and and all that stuff. First of all, it needs to be in Git, and it needs to be versioned. And so I need to know what versions of my code went into producing this stuff so I can understand it. And so and then the second part is, like, how do you promote

reuse and sharing in in a system and and in a big analytic system where you've got these pipelines all over the organization.

Well, first of all, the common place now is not that all these pipelines are centralized, and they're not all in Git.

They're some of them are in share files or somebody's using Git or, like, there are all these sort of rat trap tools out there that say you sort in my system, and I'm gonna have my own git, and then he's gonna have his thing. She's gonna have their own git. And so there's no there there. There's There's no place to, like, know where your process lives. And

it it just that number 1 seems just entirely insane to me because you're a data driven organization. The IP of your being data driven has the processing of the data. It's not the data itself. As an example, why do hedge funds have that stuff under lock and key? Because the processing of the data is is their value. And so we need to, you know, centralize that and put it in Git or multiple Git repositories, but just get it somewhere where where people can. And then the second part is how do you actually create usable components and and get people to refactor their code. And, look, first of all, it's recognizing that it's code. And then, you know, what happens in a lot of organizations is code is mailed. I mail a SQL script or some Python script or here's a Tableau workbook, and then or I pick up 1 from a share drive, and then I rename it, and then I start editing from there. And so what happens is if we just think of the world as a bunch of if then else's in a simple way, and if then else's in 1 tablet workbook, now it's sort of spreading

and like a virus to a bunch of other notebooks.

And, well, do you know where that if then else is? How many notebooks is it in? Is it right? Did someone change it? And maybe that if then else is 1 of the things that the CEO looks at. So you need to refactor and think of the work that you do in data and analytics like software, and you're in the complexity business. And so you have to think about how you can make time

in every unit of work that you're doing. If you call it a sprint or call it a project to improve what you've already done and refactoring and making libraries and handling. It's not organizational complexity. It's code complexity

that and and not how to have these big hairballs. And so libraries,

refactoring, reusable components, shareable components, all need to be in the discussion here when you're building these complicated value

chains of of that happen to have data flowing through them to get get value. So because I think a lot of people have the mistaken notion that and and I don't know if it's mistaken, but they they give too much weight to data, and they think somehow data is different. And, you know, always the data is big or, oh, the data is, you know, the data is complicated. And so it doesn't matter if the data is big or small. It doesn't matter if the data is structured or unstructured. It doesn't, you know, matter if it's pictures or nice rows and columns. You still have these systems at which data is acting that's being acted upon that are guided by code. And that that's what really that's what we need to focus on more so than worrying about data. And, you know, I find that none of these systems are ever gonna you know, that 1 you mentioned by Cisco of, like, I can do trace precedence across my report of where all the lineage came from. In most most organizations, that's just a complete pipe dream. That's never it's gonna happen because you've got 4 different languages acting upon it. It's gone through 4 different systems. And so you have, like, at at least you could have a chance of it if you had all your code in 1 place that acted upon it, and you knew to produce this data, here's the codes that acted, and here's the results that that it came from. You could have a chance, but sort of the data centric view, I think is I think it's just mistaken because, we're trying to look for that. We live in a best of breed distributed systems world, and and data systems are 1 of the more complicated distributed systems out there because they go through so many places and so many transformations, and they they hop from system to system. And and I think we've gotta recognize that this is sort of a distributed system type problem, and that is getting worse with the cloud and with the idea that you can have these ephemeral infrastructures that infrastructures

that you can turn stuff on and off. And so we've got a, you know, we've got a distributed system problem here. And so having the sort of

grand inquisitor boot on the neck, here's the 1 system that's gonna rule them all. I think I just mixed metaphors there of Lord of the Rings and the precious Karamazov. Sorry about that. But that sort of thing is not gonna happen. And, you know, people wanna try it with very smart, very creative, very interesting people in data and analytics. And there's all these new cool tools that are coming upon. But the common denominator across every 1 of these cool new tools is they all take code and configuration to drive them. And that's where people need to focus their time and and start talking about

which is the right tool from the the job, what places it, how can I make reusable components that I can can share and and understand, and then how can I see the effect of a change in 1 of those pieces of components

in the larger system? Because it's a network that we're talking about of these pipelines. And so how do I if I change the if and else, which I happen to written in Scala,

which actually ends up changing the schema, which means the data is gonna be different, which means the model is gonna be affected and and for reports. I you need to know that ahead of time. There's just it's absurd that you have to have meetings to do that, or you're you're just hoping you're waiting for your customers to find the problem. And so

how, like, on DevOps y and on agile is that? I mean, you know, I I think it it sounds fun, but it's just it's a it's not the sort of passivity of of data engineers and and data people sort of waiting for things to break. It's really I find very frustrating because there is a better way that you just don't have to sort of take the abuse of, like, oh, I didn't know it's working and suddenly it's wrong because I can't conceive another way of living that I'm

and, you know, if anything, that gets me up in the morning. And and again, I'm on sort of I I must be working from home too much. But that, the data engineers and scientists and people who do data visualization need to sort of reclaim control of their destiny. They can't live in the specificity of, like, I only focus on my little thing, and I that's done, and I don't know what happens in production. I don't know if people are using it. I don't know if it breaks. I mean, oh, yeah. I'm gonna get calls on Saturday to go fix stuff, and that's just my lot in life. And I just I I think that that is entirely too passive way of working, and we all, as an industry, need to sort of reclaim control and and sort of stop the techno techno fetishism and stop the the, the sort of way of thinking that data is different, and we're a complexity organization

that we have to deal with the complexity and and the code as as front as front center things. What are your thoughts on things such as great expectations

or other means of creating these

dynamic assertions

on the data as it traverses systems as a means for being able to regain that control and visibility throughout the organization?

Oh, yeah. We're just, like, freaking yeah. Great expectations is great. I love automated testing in in production is is hugely important. Right? Because so first of all, you run a factory. Right? And and those factories consists of bunches of pipelines

that have been placed on that factory by different organizational structures. At the end of the factory is your customer. And so the input of that factory is raw data, and each 1 of those pipelines are processing it, you know, based on the entirely arbitrary or organizational structure that that people happen to have, and so they reflect Conway's law. So how do you know that pipeline worked or the raw material going into that pipeline is actually right? And so you you need to do these, you know, whether you do assertions or you just have tests. I then there's a whole nomenclature of sort of running on top of production. Software people call it observability

or monitoring

or or, you know, some people call it testing, but it's on production to make sure that when a data item goes into those the very base level pipelines and as it follows its way through the system of pipelines that the organization has decided is there, That, when you end up with something at the end that it's right, that you're that you're not gonna be woken up on Saturday morning or pulled out of the soccer match or, like, 1 of the guy I met at a conference saying he was his kid's birthday party, and he was sitting on the bathtub fixing some pipeline code on that Saturday because something was wrong. And so, there's just gotta be a better way. And and testing in production gives you a way to help ensure that you don't have those problems because you're alerted sooner. And then the other sort of opposite direction of of testing is the sort of more software like testing where you're doing regression and functional and unit testing that happens not on production with production data, but happens in development on development data. And I think that is has a whole level of testing too. But I think the the great win here is that the intersection of those sets, the sort of production observability

monitoring tests that happen when the your data is changing, but your code is fixed. But in development, your your data is fixed, but your code varies. And those 2 sets of tasks, automated tasks, and however you group them together are largely the Venn diagram

largely are the same. And you should always

spend 20% of your time doing automated testing. And if you're not, you're just in a world of hurt. And

it's, you know, we're building these high high code systems, and the the test to code ratio is so low. And if you're any kind of software manager, you just, like, pull your hair out when you see these. And and and so that's also the way, you know, I I because I, you know, I jump ship from software development in 2, 005. But, you know, having been companies that I I've been part of the software because the companies I've worked for have built software. So so, you know, I think great expectations.

There's other companies that are just starting to focus on observability and production monitoring.

But testing, it it's sort of the 2 the duality of testing, monitoring and and in production, and then doing what we would consider software developers do in development. That's super essential. And and you won't get any of the benefits without it, and and it has to be across every pipeline in the organization. Right? Because you you wanna how else are you gonna tell if, okay, the organization's got pipelines all over the place. And let's say you have some software that spans all those pipelines across the organization. And, you know, my company has happened to build 1 of those or you could build 1 yourself. And I changed something in the back end of those

of course, are built in a different system with a different engine that they run-in is to have automated tests on top. And so you need that you need tests across every bit of technology to make sure that this actually the system works because you need a button to push. If you you want the button to push that says, if I change this, will things screw up? And you need that button to happen on every pipeline and so across all the organizational structures that people have. And so that's what really is gonna it's not about meetings and it's not about let's hold hands and have really good Jira ticket tracking. It's a and and that's the same idea that came in in CICD. It's like, if I can push things to production, if I can do continuous delivery, if I can have my 22 year old who just got a college make a change in my my million line web app, and they can actually deploy to production because they're empowered. That's fantastic. And how many data organizations actually benchmark themselves, sync? I have a 22 year old data engineer. I'm giving them a button to say push to production. I mean, is anyone does anyone who even listens to your show have that case now and know that it's gonna work because they've been able to test it. They haven't broken any models or visualizations or governance on top of it. It's actually that's a fairly rare,

company. And I think we that's 1 of the things that I think we we should change. And anybody who does have that magical button, please let me know so that we can have you on the show.

But I don't think it's, you know, I don't think it's magical. I actually, you know, I mean, I just think it's it's I think it's necessary. And I think it's a change in perception,

because we have to build a system next to the system that we're building that does that work. And that is actually work that the smarter people in the organization should do. Satya Nadella has got this quote, this great manager of or leader of Microsoft turned it he really turned it around. He says, if a developer in the Siemens software has to choose between a feature and productivity, they should choose product developer productivity. Tesla talks about, it's not about the machine. It's about the machine that makes the machine, the factory. I mean, how many people Google's got thousands of people who work on on productivity. And so even Deming said, you know, when there's a problem, it's very rarely the special cause, a person being lazy. It's often the process that that the the team lives in. And so we've relegated the process that teams these data and analytic teams live into a human realm. And I'm saying you need to build systems to instantiate that process, and that is part of your job. And that is actually the productivity enhancing thing. And as a CEO, that's often 1 of the most important things that you can do. And so you should have your smarter people working on that infrastructure for productivity, that infrastructure for low errors, that infrastructure for collaboration.

And that really that's gonna pay untold dividends. And that's why DevOps engineers are paid more. That's why every any software project

now that doesn't have a CI and CD pipeline, that doesn't have automated monitoring and and observability

in production. Any you know, you shouldn't there's Joel's list, right, that says when you go like, Joel on software wrote this list and and, you know, is your is your is your stuff in source code? Do can you audit can you have a build? It's you know, your developer box shouldn't take 3 weeks to get set up. Those were we need, you know, we need a list like that, and and people have to have it's a different perception of what's value, and and we've gotta get rid of, like, my value comes from the fact that I've tweaked my UI or model. It comes from your we're building systems that people need to work in, and it's really important that the systems that people need to be working be well engineered. And so it's and whether those well engineered

operational

systems that people work in, you know, I I have called those data ops. And there's a bunch of stuff in there in terms of and a bunch of companies who are focused on that. And some people call it ML ops or model ops. Some people call it observability for data science. Some people say it's CI and CD for data engineering. And the terms are all out there, but it's just a it's sort of a mental focus that really changed in software engineers. And it really changed for me kind of in, you know, in the in the 2000 where I thought release engineering and operations were really for lesser beings. And I really, that that changed, and the relevance and the importance

really rose up. And that's what I think the data ops idea is about. And that's why I think why a lot of people are are interested in it because they they see that it's just, you know, it's not about it's it's a, it's a different perspective on the world. And we need to build really strong, really well engineered systems that allow us to do good data engineering and science and visualization.

And,

so that sort of systems approach and engineering approach to putting together these complex distributed

systems across an organization and across the diversity of people and the diversity of data, I I think is really, is really important. And 1 of the core elements of DevOps in particular

that you've touched on a few times is maintaining

the

focus on the business value that you're producing and not just on the technical widgets that you happen to be churning out. And that's a big point of the complexity that's arising in these data organizations as well is that it's very easy to get deep in the weeds of tuning the distributed system for performance or making sure that your pipeline is reliable and has high uptime. But if all you're doing is delivering bits that nobody actually looks at, then you're just wasting your time. And so what have you found to be some of the most useful strategies for being able to maintain a connection to that business value and the business need throughout the different stages of the data life cycle and being able to propagate that information

throughout the entire throughout the entire organization and the data responsibilities

that are touching on all those different pipelines?

Well,

you know, I'm I'm a believer in being data driven.

And so I think

the pipelines

across an organization and the people who are building those pipelines and the people who are operating those pipelines, they need to be measured. And you need reports and algorithms

and a data flow off those systems to be able to understand if they're working. And that is a way that you can judge whether your system as a whole is effective. And part of those reports could be, you know, a net promoter score of of your reports. They could be a a view score of who's looking at it, but they also could be as simple as what's your test coverage.

You know, do you have tests across all your system? What's your deploy rate? How are you judging

the productivity of your team? How many feature points are they getting in? How you know, what's their rate change of of things on a particular project? You can also look at, are they being on time? Like, you know, what's their are they meeting their SLA? And then 1 that's hugely important, I I think, is error rate. And, you know, if you look at the sort of lean manufacturing Toyota idea of sort of loving your errors, counting your errors,

and then putting together plans to remediate those errors and saying, okay, well, we made an error. Let's fix it. And so we don't happen it doesn't happen again. And getting away from a shame and blame culture to a more

error loving, maybe you call it lean or total quality management. There's lots of different terms on it. And so I think if you're gonna

be able to have an effect on the customer, you need to iterate quickly because you don't know what the customer wants. But in order to iterate quickly, you've got to build a system that observes how your team is working, and you've got to measure what your team is working. And I find it somewhat hypocritical that the people who lead data and analytics

organizations

don't run their own data and analytics organizations with any data and analytics

That, you know, that they're like, where's the report that tells my that my team is doing? What's and so I wouldn't hire

a software engineering manager who couldn't say, describe to me what's in his his or her 1 page dashboard for running a software engineering team. You know, that's that's expected. Right? To to and so what's the kind of report that a CEO has to prove that it's there? And I I find that those sort of operational metrics, I think, are really important for

not only for teams to

prove their worth in data and analytics,

the boss's awesome report for the CEO, but also for a way for people to overcome some of these organizational

problems. Because if you're a data person and you see data, you're gonna react to it and say, okay, it becomes instead of being Hatfields and McCoys, you're wrong. We should do it here. It's like, here's the data. Look, we've had 10 errors and here's the source of the errors. What's the number 1 thing error that we can fix? Is it because and is it the error because

a data quality problem in our source system was some action that happened upon the data wrong, their server go down, was it just late? What's the error and and how can we remediate that error? And so getting, reflecting, and using a dataset

and having a shared experience of what was happening with the process of our data instead of anecdotes and side comments, I think is important. And I and as a data person, you should believe in data. So why aren't we getting data about what's happening with our data? And that can I think that can also be very helpful for people in overcoming these organizational

complexities

and as well as these technical complexities because you can start to see and understand what's what's happening and be able to then

have a a sort of a shared common experience?

And then that shared common experience can

be used as a basis to overcome, you know, people's natural reluctance to admit that there's something screwed up and that that or that they screwed up and that they they have to improve.

Are there any other aspects

of organizational complexity or technical complexity in data organizations that we didn't discuss yet that you'd like to cover before you close out the show or any,

last pieces of advice for people particularly working in the data engineering space? No. I I I thank you for the opportunity. It's been, it's been fun. And hopefully, I didn't,

I feel like I, I was talking too much to bias, and and I didn't give you a chance to ask the questions, but I sort of, I sort of got going today. But, yeah, I think there's a real opportunity for us to reframe how we work together. And if we focus on building that machine, that makes the machine. I think that's a really interesting challenge. And I think, you know, if anything, that's what's happening in the world now with COVID 19 and the upcoming recession, we're gonna all have to face the challenge of how to do more with less,

because we're not gonna get speculative

projects anymore.

And so

my fear in data and analytics is that we've gone through a boom cycle and that we're gonna go through a bust cycle because people aren't gonna see the value in in analytics.

And and the only way I see to fix that bus cycle is to focus on operational efficiency. We can't invent another 2 letter algorithm. It's been ML and AI and big data is a word. You know, and it's and the industry sort of invents the next buzzword. But we've really got to focus on effectiveness,

doing more with less, Because we're gonna we're in a recession, and we're gonna be in 1 for a while. And the data and analytics field is gonna more and more teams are gonna say, you know, we spent we got 50, 100 people doing data and analytics. What's the value? You mean, I can only update my data warehouse every 3 months and, like, we still got some consulting teams doing this and, like, you know, we're or we've got contracts with people and they're paying them 10, 000, 000 of dollars a year to give us reports.

And so, you know, the yield of our data if you look at it as a whole, the yield of our data and analytics investment, we're gonna have to focus on in the next 5 years because that's just what happens in,

recessionary cycles. And I think that's actually a good thing, and it's really an opportunity for people to refrain. Reframe

how they think of it and saying, okay, yeah, I've done machine learning. I've done some algorithms. I've done some data science. I've done some biz. Well, what's next? Well, the next thing is is focus on the effectiveness of the system, which is applying these sort of, data ops type principles to it. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? Well, I think there's a lot of there's actually in my area, the data ops area, and, you know, maybe it's self interest in saying this, but I actually see there's more smaller companies,

working on individual pieces of the data ops equation. And so, whether it's on on deployment or production monitoring or, you know, data privacy, test data management. I think there's a there's a wealth of opportunities for because that sort of operational area of not using it's not about databases or ETL tools or VIZ tools or data science tools. It's how you operate that whole system in a connect with complicated data in a complicated organization. There's a lot of runway there for people to to do. And if you look at the amount of innovation that's happened in the greater DevOps market and software over the 10 years. And between open source projects and closed source projects and consulting companies, there's just a lot of runway there. I I see. And

because there's just a lot of things that you can do and and ways that you can approach the from a growing your career, from learning technology. So to me, I just think there's, you know, we've, we have a data ops blog. We've written about the data ops enterprise software industry. There's a bunch of companies there, closed source, open source,

different ideas. And I think it's, it's gonna be a really cool next 5 years to see the industry emerge. Like, I just talked to a company this morning who does what they call data SecOps, which is really

applying security principles to the data that's being transformed in a in a data ops driven process. And so,

there's there's there's a lot of, I think, cool stuff going on that that give people a chance to learn and the people for people a chance to get involved in. Well, thank you very much for taking the time today to join me and share your experiences

and thoughts on the ways that DataOps can help improve the

organizational and operational aspects of the data pipelines that we're building and running. It's definitely an interesting area and a challenging 1. And as you mentioned, 1 that we need to spend a lot more of our focus on. So I appreciate all of your time and efforts on that front, and I hope you enjoy the rest of your day. Alright. Thank you much, Tobias. Have a great day. You as well.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links