Building Tools And Platforms For Data Analytics

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution,

they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com

slash linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, then don't miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity,

Corinium Global Intelligence, and Data Council.

Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit in Graf Forum, and Data Council in Barcelona.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Ben Stancil, chief analyst at Mode Analytics, about what data engineers need to know when building tools for analysts. And just full disclosure that Mode is a past sponsor of this show. So, Ben, could you start by introducing yourself? So my name is Ben Sansell. I am 1 of the founders and chief analyst of Mode. Mode builds a a product for data analysts data scientists. So I'm responsible for both our internal analytics here at Mode as well as working a lot with our customers to help them better understand or help us better understand,

the needs that they have in the product, and how we can we can better serve them as as analysts and data scientists.

So prior to to Mode, I worked on the analytics team, Yammer, which was a startup that was purchased by Microsoft in 2012.

And before that, my background is in, economics and and math. And so I actually worked for a thick tank in DC for a few years, doing economic research before landing in San Francisco and in the tech world. And do you remember how you first got involved in the area of data management?

Yeah. So it was actually as a customer, really. So I was working as an analyst at Yammer. My first job in in tech,

was was at at at Yammer. And I was really a customer of our data engineering team.

So we use the tools that they built, as well as the data that they provided. So Yammer was was kind of 1 of these early leaders in the philosophy that engineers shouldn't build ETL pipelines,

which is now something that's become a little bit more popular. There's a there's a blog post,

from the folks over at Stitch Fix that that talked about this very explicitly. But Yammer had this the same And so while I was there, we were responsible for building our own pipelines and for sort of dipping our toes into the the data engineering and data management world.

And so that was kind of my first taste of it.

Then after leaving Yammer starting mode, which I've mentioned is is a product for data analysts and data scientists,

I actually ended up taking kind of 2 further steps into data management first.

I'm responsible for our own data infrastructure here at Mode. And so my role is to think about not just the analytics that we do, but how we actually get the data in the place that we wanna get it. But, also, in a lot of ways,

Mode is serving the same problem or serving the same providing the same service that our the Yammer data engineering team was providing me when I was an analyst,

which is we are now building tools for other data scientists. So

the the product that we provide, we very much have to think about how does it fit into sort of the data management ecosystem, how does it solve the problems that not just analysts and data scientists have, but the problems that data engineers have,

when they're trying to trying to serve those those customers. And so you've mentioned that at your work at Mode, you're actually responsible for the data management piece and that you're working closely with other data engineering teams and other analysts to make sure that they are successful in their projects.

And I'm wondering if you can start by describing some of the main features that you are generally looking for and the tools that you use and some of

the top level concerns that you're factoring into your work at Mode and the tool that you're providing to other people. Yeah. So internally at Mode, the 1 of the things that we really care about is

we wanna make it something that is that is easy to use for the analysts and and data scientists who are actually consuming that data. So, again, kind of go back to the to the the point from that stitch fit stitch fix blog post. We really believe that that the data scientists here at Mode should be responsible for as much kind of data management as possible. That there's a lot of great tools out there now, that are ETL tools or or warehouse tools,

or or pipeline tools

that analysts can manage pretty well. And you don't actually need someone to to be kind of a

a dedicated capital e engineer to to really build out the initial phases of of a pipeline.

And so for us, when we when we evaluate those tools internally,

we wanna make sure that that there are things that we can set up pretty easily, and there are things that as customers of those tools who aren't necessarily

the the sort of, again, fully fledged engineers ourselves,

we still know how it works and can still make sure that it's up and running and and performing the way we want. I think the analogy we often use with this is it's like buying a car that

you don't necessarily need to know the ins and outs of how the car works, but you need to know that it's reliable.

And and if you learn to to not trust the car, that it's not actually gonna drive when you want it to drive,

you don't wanna actually learn how to fix the car. You just wanna buy a different car that actually works. And so when we're when we're looking for tools ourselves,

we tend to focus a lot on that, on, like, what's the experience like for for the folks who are using it? Can we rely on it? And is it something that that we need to, you know, have a dedicated person to to run? Or is it something that we can kind of run-in the background and the the analysts and data scientists can get it to work the way they want it to work?

The other thing I think that we really look for

is,

usability. So

I think this is a place where where ETL tools and and data pipeline tools, the folks who are building them often often don't think about as much as perhaps they could, which is the surface area of those tools isn't the application itself or the web interface.

I really think of the surface area of those tools as the data itself.

That that if I'm using, an ETL tool, the way that I interact with that tool day in and day out is by actually interacting with the data that that tool is providing,

not by logging into the web interface and and, you know, checking the the status of the pipelines and things like that. And so in those cases, little things matter. It ends up being column names that matter. Like, are there weird capitalization schemes, or or are there periods and column names? And those little things that make it more frustrating to work with that day in and day out end up being things that really drive kind of our experience with those tools. When we're working with customers, and and so most customers range from being being small startups to to much larger enterprises.

I think for small startups, they often look like us.

For the large enterprises,

the the place that we really try to try to focus is making sure that the tools that we recommend are modular,

The data stacks end up becoming very complicated. They end up having to serve a lot of different folks across a lot of different departments,

pulling data from tons of different sources.

We try to avoid people focusing on, like, 1 tool to rule them all. This kinda having 1 pipeline, 1 warehouse,

1 analytics tool, all of these things serving every need, I think is often, it it sounds nice, but it's often very difficult to actually create that.

And we'd rather people be able to kind of modularize different parts of their stack so that if something new comes along that they wanna use, they can easily swap something else in and out without having to kind of rearchitect the entire the entire pipeline. A couple of things that came out at me as you were talking there are, 1, being that

you're talking a bit about some of

the hosted managed platform tools where anybody can just click a button and have something running in a few minutes.

And then on the other side of the equation, particularly if you have a very engineering heavy organization or a larger organization.

You're probably going to be having dedicated data engineers building things either by cobbling together different open source components or building something from scratch. And I'm curious what you have found to be the juxtaposition

as far as the level of polish and consideration for user experience

of the analyst at the other side of the pipeline as you have worked with different teams and different tools that fall on either side of that dividing line of the managed service versus the, build your own platform? So the the managed service,

it depends on the tool. I think we've some tools do a really great job of this, some tools less so. I think that's it's probably

true for any products in in a similar space. Some some some products do a really great job of thinking about the experience for customers, and others are more focused on technical reliability or more focused on on other aspects of of that product experience. You know, I think that that for an example of of 1 place where I think, like, there's a great tool that focuses on some things that that work really well, but also has 1 of these pain points. Snowflake, for instance. So we're actually Snowflake customers. The the database, it's a it's a very powerful tool for us. We recommend it to to post anybody,

but they they I believe in the tradition of of the Oracle folks from which it came,

all of their column names are automatically capitalized.

And so it's just 1 of those small irritations that that it seems like when they developed it, it wasn't necessarily a consideration of how our analysts gonna be interacting with this day in and day out when all of your queries are constantly yelling at you because you have to capitalize everything. So little things like that, I think, are places where companies could could think a little bit more about the the ways that people use it. From from the perspective of internal tools and the folks who are building these from from scratch, I actually think in a lot of cases, those tools tend to be better in the ways that they think about user experience because the people who are building them are sitting directly next to to the customers

desk over from you, all of those things, there's like a really fast back and forth between, oh, how do you actually use this? You see someone use it every day. You hear them complain about it. They, like, get the benefit that vendors don't get

of literally working with their customer day in and day out and their customer being able to to, like, walk over their desk and say, hey. This is a thing that can you change and stuff like that. So while the internal tools often aren't as technically robust, and often aren't as is reliable in a lot of other respects, and aren't nearly as powerful and flexible. The the small things often work a little bit better because they were custom built for for exactly that audience. For the things that you're talking about that contribute to just an overall better experience for the analyst, things like managing the capitalization of column names or preventing the insertion of dots in the column name that will break a SQL query. What are some of the guardrails that data engineers should be thinking about in terms of how their tools

are able to generate specific types of output or the overall ways that they can be thinking of

the end goal as they're building these different tools and pipelines that would make it easier for analysts and other people within the business to be able to actually make effective use of the end result of their work? Yeah. I think it's it's being a product manager in a lot of ways. It's it's doing the research and knowing your customer. That those little things aren't things that are necessarily gonna be obvious. And it's very difficult to sort of build a framework to show you exactly how those things will work. I think the the best framework is the frameworks that product managers or designers build of of how do you understand the needs of your customer, how do you engage with your your customers and and learn from them? That you know, even even as an analyst and as someone who lives in these tools,

day in and day out and is the the customer of those in a lot of respects. I don't know that I could sit down and make a list of here's all the little things that I like or don't like. It's something that that you very much realize as you're working on it, in the same way that I'm sure for engineering products or or tools that are built for engineers, they have opinions about about how those should be built,

but don't necessarily have, like, an ability to just write down these or the this is the framework for understanding all these things. I think a lot of it is wanting to build something that that your customers like and then taking the time to listen to them and understand sort of what are these pain points they're having and and where do they come from. Like, why what are you trying to accomplish when you do it? I think a lot of it is sort of the fundamental aspects of of product management

and and sort of user research to to really get at the the core of what those problems are. From the

perspective of, team dynamic or an organizational

approach

to managing these data projects,

how much responsibility do you feel lies with the data engineering team to be able to ensure that they're considering these end user requirements

as far as the capitalization of schema names or things like that. When they're also concerned with trying to make sure that they're able to obtain source data, they wanna make sure that their pipelines are reliable

and efficient,

and they have, you know, maybe n plus 1 different considerations

for the overall management of the data itself before it gets delivered, and how much of it is a matter of incorporating

analysts in that overall workflow. Basically, just trying to get the breakdown of where you see the level of understanding

and responsibility

for identifying

and incorporating

these UX points in the overall life cycle of building these products?

So generally, I would say, like, the the tool needs to work in the other respects and and recognize that a lot of the the data engineers are building these products or either internally or as vendors. There's lots of very complicated problems they have to work on, and I think I think most analysts would would recognize that as well. I think the the responsibility

doesn't necessarily lie in for the for the the engineers building these tools

to go out and determine all this stuff on their own and not get the help of their customers to be able to tell them to do that. I think that that the thing that is the responsibility of the tool builders is more just having the empathy of

the the customer that's using it. That it's it's less about, you know, I need to go figure out what are these these little things, these usability issues or or other things like that that are gonna be the things that get in the way of my customer using this every day. You know? They should be they should be sort of more just willing to listen when somebody has that feedback and and recognize that those sorts of things are also things that will affect how somebody uses the the tool that they built. So I think, you know, you can't again, as with any product, I I don't believe you can build something that's a a technical marvel,

if it's not something that people wanna use. There's there's plenty of examples of of tools and and companies that have done this and have focused on, you know, if I build this this

monument to to

some technical expertise, then people will come use it. Well, you know, not really. Like like, people will use it because it helps them solve a problem.

And and while you need to be able to figure out a way to balance those 2 things, I do think it's there's there's some empathy to to the customer that's necessary there of of what is it that makes you wanna use this thing every day. Yep. All of the the sort of upstream technology that requires it is very important. Obviously, if there's super organized and super clean and and super sort of well defined data, but there's not any data in there because nobody actually was able to get the data from the source into the to the warehouse or whatever, obviously, nobody's gonna use that either. But I think it's it's, you know, it's important to keep in mind that that usability matters, and and you have to have the empathy for for your customers as you're as you're building those products. And ultimately,

from the few things that we've brought up as examples, they're fairly small changes that don't really

include any additional technical burden on the people building the pipelines. It's just, as you said, a matter of empathy.

But from your experience of working with other data engineers and with your customers and different data teams, what are some of the common points of contention that you have seen arise between data engineers and data analysts as far as how they view

the overall workflow or life cycle of the data that they're trying to gain value from? So so, yeah, the the examples were also were simple ones. I think there are places too, especially this happens more, I think, in in the internal use cases,

where there are more complicated things that are you know, an analyst is kind of trying to frequently solve a problem in a particular way, and maybe they want

better mapping software or something like that because that's a way that, like, execs often ask questions, and they just want a quick way to be able to visualize something on a map, but rather than having to to format a particular way and load it in a Tableau and do that. And so there may be more complicated things there, which I think is is, again, kind of a a product question for the data our data engineers rather to figure out how hard is that to build and and how much value is it really providing and and understanding, again, kind of the use cases behind the the request. In in terms of in terms of these, like, sort of points of conflict or or where people can align, I think the 1 of the things I think that's that's really important

is

for data engineers to understand

how data scientists and analysts think.

That that, again, it's it's really understanding your customer,

but it's not just understanding, I need to be able to deliver

dashboards to executives and answer these challenging questions.

It's it's understanding kind of who your customers are and where they come from. And I think there's a couple a couple, like, big things that that are sort of define, a lot of analysts that that I think are are, like, critical for for thinking about how you build tools for them. 1 is they're they're trying to solve problems,

like, quickly and and often trying to to answer questions quickly and in kind of in a rough ways. Then that they'll get a question from an exec that's like, why is this marketing campaign not working?

And they're not trying to necessarily answer that scientifically. They're trying to to turn around something so that the business can make a decision. And to an engineer or to to a statistician or to somebody who's who's,

you know, focused on on building robust tools,

the way that an analyst work works may look sloppy. It may look like something where they're not crossing p's, they're not dotting i's.

You know, they're very quickly trying to to do something rough and sort of hacking their way to a solution.

But in a lot of cases, that's the whole point. Like, that is the value of what an analyst does is is take complicated questions, distill them down to something pretty quick, ship it off to somebody who's making a decision, and help the business make a decision and move forward. And so in a lot of ways, I think the the the ways that the tools get built and the ways to sort of remove friction are to understand not just the problems they're trying solve, but the kind of mindset behind it, which is this. Alright. I have a question. How do I answer it? How do I, like, draw conclusions from these observations?

It's not how do I build a a logical model? How do I build sort of the most mathematical thing possible? How do I abstract a complex system?

It often feels kind of rough and sloppy, but to an analyst, that's that's the job.

And so far, we've been talking about the API between data engineers and data analysts as being the delivered data probably sitting in a data warehouse somewhere. But on top of that, you've also built the mode platform, and there are other tools such as redash or superset that exist for being able to run these quick analyses

and be able to write up some SQL, get a response back, maybe do some visualization.

I'm wondering as far as the way that you think about things and also the way that you've seen customers approach it, where the dividing line is in terms of the platform and tooling that data engineers

should be providing

versus where the tooling for being able to perform these rapid analyses lives in terms of who owns it and who drives the overall,

vision for those projects?

Yeah. I so I think that kind of has to be a joint a joint effort. I 1 of the failure modes here, think, can be this kind of throw it over the wall. You know? You build the tools. I'll consume the tool,

where where these 2 teams aren't aren't tightly synced. That that I I think it's important for them to be able to to sort of have similar focus on the same problems. That

that it's not it's it's not just for like, data engineer shouldn't just be thinking about my objective is to build a tool.

I I am a believer that data engineer should sit very closely to data scientists or data analysts and should basically have the same

the same KPIs. The the objective should be how do we answer the questions we as a business need to answer. And a data engineer's job is to to enable that. Their job isn't to say, like, alright. I've I've, you know, hit my KPIs because I delivered a tool. They should be trying to serve the same needs as the data scientists. And so I think it's if if you end up with the kind of, okay. We'll build a tool, and somebody else will will take it and consume it, I think you end up with this disconnect where where

the analysts aren't able to actually, like, deliver the quality analysis they need to deliver,

and there ends up being a lot of friction at that at that, like, touchpoint between the 2 because analysts are looking for a particular thing. They come back to the data engineers and ask for it. Data engineers, you know, feel like they're being sort of told what to do. You wanna be able to allow these groups to be a little bit more autonomous. And I think the only way you could do that is allow them to be invested in the same result. So it's it's enabling engineers to understand, like, what is the value of their product, not just to the analysts, but how does it provide value to the to the broader

entire users around the company? And so in in cases like Superset, I think

the if if I'm a data engineer building you know, implementing Superset at a company, I wanna understand not just, okay, great, they want Superset plugged into this this data. It's what questions are you trying to answer to that? You know, at what frequency do you need to do it? Who's the customer of those questions you're trying to answer? All those things help drive some of the decision making behind, you know, how do I where do I how do I get it up and running? Is it something that everybody needs to have access to? Is it something that just analysts need to have access to? Is it something that's to be shared easily? There's a lot of a lot of work there, I think, that is in this gray area between the 2. And I think those those 2 groups need to be, like, open to that gray area rather than sort of perfectly defined. You work on just these things. I work on just these things. Yeah. In some ways, it's changing the definition of done where in 1 world, the definition of done for a data engineer is I've loaded the data into the data warehouse. I have washed my hands of further responsibility. It's now in somebody else's court versus the definition of done is I was able to get the data from this source system all the way to the, you know, chief executive who needs to be able to use the resulting analysis to make a key decision that will drive our business to either success or failure and aligning along those business objectives rather than along the sort of responsibility

objectives, which is 1 of the recurring themes that's been coming up a lot in the conversations I've had on this show, as well as a lot of the themes that have been arising in the division between software engineers and systems administrators

that led to the overall sort of DevOps movement

and just a lot more work in terms of aligning the entire company along the same objectives rather than trying to break them down along organizational

hierarchies.

Yeah. I agree. And and, you know, I think, like, 1, this is a a kind of again getting back to some small details. But 1 example of this actually comes to mind where this this broke down was if you had a team, like a data gen team that was very much focused on loading data to a warehouse, an analytics team that was repose responsible for, like, taking that data and and passing it off to somebody else, and answering questions with it.

And there was,

like, a column name, and I believe the column name was updated at because I think that's like it's just like a system table that, you know, a lot of, like, rails apps and things like that have updated at time stamps that their their system generated. That was that was put into the warehouse because that was what you know, it was through an engineer, it made perfect sense. They put it in there. It was like, it's clearly named, all of that. To an analyst or data scientist, they interpreted that to mean something different without the kind of, like, understanding where they were trying to go with the problem down downstream from it. It was a it was like a clean handoff that to both sides. It's like, update it. I know exactly what that means. Both sides thought they know exactly what it meant, and then they ended up creating these analyses on top of that with the assumption that that column means something that it didn't mean. And just by having this, like, okay. You take the ball. Take no. You run with it, and and I sort of, like you said, wash my hands of it. It ended up driving sort of a very bad result. And so that problem, it could have been fixed very easily. But because the team was wasn't focused on kind of the end result of the business objective, the data engineers never actually saw the analysis that was being produced. They never really understood the questions that were being asked. They assumed, okay, you're using it the way you should be using it instead of stepping back and saying, like, alright, let's kind of work on this. The the actual problem together to make sure that we're we're solving this problem in the right way. This is where another 1 of the current trends comes into play of

there being a lot more focus on

strong metadata management and data catalogs and being able to track the lineage of the data so that you can identify, oh, this updated at column

wasn't created for the data warehouse. It actually came from the source database. And I understand now a bit more about the overall story of how this came into play versus having these very just black box approach of all I know is this is how it is now. And then also in terms of metadata management, being able to know

how frequently is this data getting refreshed, when was the last time it was actually updated in the data warehouse versus whatever this updated at column is supposedly pointing to that I'm, you know, potentially misinterpreting.

Mhmm.

Yeah. I I have a I have a somewhat

negative view of documentation

on these things. I think that that is a noble goal, but it's really hard to maintain that that if you have sort of manually created documentation or sort of data dictionaries where people write down, you know, this column means that, stuff like that. It's it's often really hard to to keep that that up to date. Like, people are now adding data sources so quickly.

Data's evolving so quickly

that that that often will lag. And in a lot of cases, a data dictionary that's out of date is more dangerous than no data dictionary at all because people will go to it. They're like, oh, here's the data dictionary. I assume that this is

correct, and and it's actually

a month behind where it should be. And so people are are confidently making a decision off of out of date information rather than looking at it with a little bit of skepticism. This is actually, in my mind, a problem that hasn't really been solved. I know there are some some vendors out there that are attempting to do this. What's the the company called? I'm now blanking on the name, but but there's a company that's that's attempting to do this through kind of using the patterns of how people are actually using data to document it, sort of this automatic documentation that happens in the wake of of usage rather than people having to manually create it. I think that's whether or not that technology works, it sort of remains to be seen. But I think that is the right way to think about documentation,

where documentation is really a product of the way that people use something. And really,

the way that when I have joined teams or or had new folks join teams, the best way that they learn about how different pieces of of the data sort of schema works or the arc like, how things are connected together

is often from seeing how problems have been answered by other people and mirroring that. It's it's like documentation

based on on actual usage and and documentation

sort of centered around

the the ways that people

define concepts rather than documentation based on some giant Excel file that is, like, this column is of this type with this data. There there are a few folks I've seen that pulled that off, for the most part, it just becomes a huge time sink to invest in it and something that almost always ends up lacking. So that that is a tricky problem. I, you know, I think that's something that that over time folks may figure out. But definitely

1 of those things that that for now has is almost has to be a a little bit of we learn by doing rather than rather than we learn by leading a manual to know exactly what these things are. There probably are some places where you could you could manual to know exactly what these things are. There probably are some places where you could you could include sort of, like, a common pitfalls type of thing of, like, don't use this updated at time stamp or this thing. It

says month to month, but in reality, it's not. Don't trust it. You know, those sort of little little, like, gotcha kind of things. But but, like, a broader documentation is is something that we haven't seen anybody implement terribly well to this point. I definitely agree that the idea of the static documentation as far as this is where this comes from, this is how you use it, is

grounds for a lot of potential error cases because of, as you're saying, it becoming stale and out of date and no longer representation of reality. I was actually thinking more along the lines of the work that the folks at Lyft are doing with Amundsen

for being able to use that for data discovery and having a

relevance ranking as far as how often it's being used or the work that WeWork is doing with Marquez, where they're integrating a metadata system into their

ETL pipeline so that it will automatically

update

the information about when a table was last loaded from source

and what are the actual steps that a given piece of data took from the source to the destination where you can look up the table in the data warehouse and then see what are the actual jobs that brought it there, and when were they last run, and were there any errors to be able to get a better context

from the end user perspective as far as what was the path that this data took so that I can have a better understanding about where it came from and how I might actually be able to use it effectively. Yeah. I think yeah. I I I think those are super interesting projects.

And and there's a recently,

a company

called elemental released, an open source tool called Dagster that kinda follows in that same same pattern of trying to create make pipelines that look a little you know, you're able to sort of parse your way through them a little better and diagnose kind of, oh, this thing went from step 1 to step 2 to step 3. And and I think that that stuff I think can can be super interesting for for analysts and data scientists because 1 of the 1 of, I think, the big missing pieces in data stacks, and it it's it's maybe solved by these, maybe not, is if I'm working on a question or if I get asked a question, I'm sort of investigating some data and something looks something looks awry, like, something doesn't quite look right, there's always a little bit of the back of my mind that makes me think, I wonder if this is a data problem. And and you're never able to quite escape that. And and part of the reason I think that's true is, like, pipelines are are notoriously fragile. You're always gonna, like, miss some data. There's always little things that that you you have to go through this process of, like, this this result doesn't quite make sense. I wonder if I'm double counting something because this 1 to 1 mapping that I thought was in place actually isn't. That something got written in a way where we thought we had, you know, 1 Salesforce opportunity per customer, but it turns out a second 1 got created somehow. And you have to kind of go through this this sort of down this rabbit hole of of checking your data in various ways. It's not just like, was it loaded properly, but all of these other, like, unit test type of things that I I don't I don't know necessarily how you quite avoid it. There are probably technologies to build to help a lot with that, but there's there's nothing really in place that that gives an analyst or a data scientist, once they look at something, full confidence that, yes, this is this is something that I understand. I know exactly what it is, and I need to investigate

the business part of the problem that this data is telling me rather than, well, should I, like, check and make sure everything is working before I before I go too far down trying to understand why the thing happened that that I think may or may not happen. And so any step to me that moves in that direction, whether I was Amundsen, whether or not it's it's the the thing that folks we work folks are building,

whether or not it's it's the sort of unit test type of type of tools that DBT has built. All of those things, I think, that provide a little bit more confidence in. These are I can now check off some things on the list that I was gonna have to go check to make sure things are right. The faster you can get through that, the faster as as an analyst I can focus on, again, solving the business problem,

rather than

kind of bending your head against, like, is it a data problem? Like, what do I not know before I before I wanna go take this to an executive and say, like, oh my god. You know, look, our revenue is doing great this quarter and then you don't the last thing you wanna come back is, like, well, you actually had a data problem and that thing that I told you wasn't true because I, you know, failed to investigate this 1 pipeline that that did a thing that I didn't expect. That's actually a great point too to be made as far as the relationship between data engineers and data analysts in ways that data engineers can help make the analyst job easier is

actually making sure that they are integrating those quality checks and unit tests and being able to have an effective way of exposing the output of that as well as incorporating

the analyst into the process of designing those quality checks to make sure that they are asserting the things that you want them to assert. So in the context of the sort of semantics of distributed systems, there's the concept of exactly once delivery or at least once delivery or at most once delivery, and then understanding how that might contribute to duplication of data or missing data, and what are the

actual requirements of the analyst as far as how those semantics should be incorporated into your pipeline, and what should you be tuning for, and how are you going to create asserts, whether it's using something like dbt as you mentioned or the great expectations project or some of the expectations

capabilities that are built into things like data lake. And then having some sort of dashboard or integration into the metadata system or a way of showing the analyst at the time that they're trying to execute a query a data source. These are the checks that ran. These are any failures that might have happened, so that you can then take that back and say, I'm not gonna even bother wasting my time on this, because I need to go back to the data engineer and tell them this is what needs to be fixed before I can actually do my work. I I very much agree with that. And I think that that there's a lot a lot of time gets sunk into you know, there there's the common lines of, like, data scientists spend 80% of their time cleaning data and all that. I think that that number obviously varies a lot. And and for folks who are using machine generated data, you know, if you're using data that's that's event logs and things like that, you still spend that much time cleaning data. Like, machine generated data is not particularly dirty in the sense of

I have to, you know, clean up a bunch of political donation files that are all, like, manually entered. Like, that's dirty data where you have to figure out a way to to take these 50 addresses that are all supposed to be the same and and turn them into 1. But machine generated data doesn't have that problem, but it has a problem of, can I rely on it? Like, did it did it fire these events properly? Did it fire them 10 times when it was supposed to fire them once? Did it miss a bunch? And so anything I think that, yeah, can help sort of that cleaning problem of of understanding exactly what I'm looking at and how how much does my data represent the truth is something that you always again, as an as an analyst, you always have this in the back of your mind. They're like, this isn't quite truth. This is this is the best we have. And in a lot of cases, I think it's close, but I always have to be a little skeptical that it's that it is truth. And so the pieces there that that can help are are a big value. Another thing too that I think actually data engineers can do a lot for for analysts and data scientists as well

is, like, engineers have solved a lot of these problems or have thought about solutions for solving these problems,

in ways that that folks that come up through through sort of the typical channels into being an analyst or data scientist have.

So, like, 1 of the interesting things about about data scientist roles is people come from all different backgrounds. Like, you'll be working with someone who's who's a former banker who, you know, is super deep in in Excel, but it's just learning SQL. You'll be working with someone who's a PhD,

who's written a bunch of R packages, but has never actually used production warehouse.

Maybe working with someone who's a former front end engineer who got data visualization. You'll be working with, like, an operations specialist who's been writing Oracle scripts for 10 years. Like, there's no consistent skill set for where these folks come from.

And so the idea of even writing something that amounts to a unit test or or writing something that lets you the the concept of version control and get

are also saying, like, those sorts of concepts aren't things that are necessarily going to come naturally to folks in in analytics and data science. And so I think there's there's places where data engineering can kind of push some of those best practices

onto to the ways that analysts and data scientists work. And I think this is data engineers can do it, vendors can do it. There's lots of different ways that we can we can sort of standardize some of that stuff.

But there's definitely

those sorts of practices, I think, that that can come from the engineering side of of kind of this this ecosystem,

to really encourage,

folks to be able to learn those things or folks to be able to push in that direction to to learn some of the pieces there that are valuable for for their jobs. Another element of the reliability and consistency

aspect

from the analyst perspective when working with the data is actually understanding

when the process of getting that data has changed. So you mentioned things like version control. So if I push a release to my processing pipeline that actually changes the way that I am processing some piece of data, and then you run your analysis, and all of a sudden, you've got a major drop in a trend line, and you have no idea why you're gonna be spending your time tearing your hair out of maybe I did my query wrong or, you know, maybe something else happened. And just being able to synchronize along those changes and making sure that everybody has buy in as to how data is being interpreted and processed so that you have that confidence

that when you run the same query, you're going to get the same results today, tomorrow, and the next day. And then all of a sudden, when that expectation is not met, you start to lose trust in the data that you're working with. You start to lose trust in the team that is providing you with that data. So just making sure that there is alignment and effective communication among teams for any time those types of things come into play, I think is another way that data engineers can provide value to the analysts.

Yeah. Like, it it as a very concrete example of this. And I and and I that trust, I also think, extends further. And I think there's there's if as a data engineer, again, if your goal is is to build a an organization that's focused on the products that you're building or to that has the mentality that the products you're building matter. You also need to think about that because so so as a as a concrete example of that, say you have, like, a a revenue dashboard. And and

revenue at companies, we've worked with a lot of companies, like deciding or figuring out how much you make, it seems like it'd be the simplest thing. It's like 1 of the most important numbers any company has. It's always impossible. Nobody does this well. It's always like there's a ton of these weird edge cases, data coming from tons of different places. It's coming from a billing system. It's coming from Salesforce, which is all this manually nerd stuff. So it's just kind of nightmare of of a process to just figure out, like, how much money did we make. But say say you have, like, a revenue dashboard, and it says today that you made a $1, 000, 000 last quarter, and then tomorrow, it says you made 1 and a half $1, 000, 000 last quarter.

And as a as, like, an analyst, that's that's your nightmare scenario because now an executive saw this, they don't know which number's right, they're mad at you because why in the world did this thing change?

They just told the board it was a $1, 000, 000, and now it's saying a1000000 and a half. And, like, is it gonna go to 7.50 tomorrow? And nobody's gonna know what happened. And so you end up having to dig through so many different pieces of that. It's like, did you write a bad query? Did a sales rep go into Salesforce? And, like, oh, they backdated a deal that that actually signed. Did they was there a data entry problem in Salesforce where somebody put in something wrong? You know, did you double count something that you didn't mean to double count? Was there a data pipeline problem where data got updated in a bad way? Like, all of those things end up becoming just this, alright, how do I figure out what how it happened? And often you don't have any real record of what the system was before. You just know it. It used to say a million, and now it says 1a half, and you're like, I have to figure this out. And and those are the types of problems that are the headaches for for data analysts and the ones that you end up finding yourselves in all the time. And the more systems that you could have in place that lets you say, oh, yeah. It's 1a half because we backdated a deal or because, you know, this other thing happened. It the faster you could answer that, the the easier your job is, but kind of more importantly,

the more trust that the rest of the organization will have in your job because you're not spending all this time trying to, like, explain a number and not sure which 1 you actually wanna stand behind. From the perspective of somebody who is working so closely with data analysts

and with companies who have data engineering teams, as well as consuming some of these managed services for data platforms,

I'm curious what you have seen in terms of industry trends that you're most excited by from your perspective as an analyst and some of the things that you are, currently concerned by that you would like to see addressed?

Mhmm.

So I think that

we've seen, like,

this isn't this isn't really a technology, but I think it's in 1 of the places where the business sort of industry can go.

That

generally industry has made, like, very big strides in enabling kind of day to day data driven decision making,

that we've done a lot about, you know, how do we get data in front of in front of people around the business? How do we get to them quickly? How when I am making a decision as a sales rep of who do I call today or or a marketing,

manager, you know, which campaigns do I focus on?

How do I how do I do that? Like, how do I make that decision? And I think we've made a lot of progress in that in that front. 1 of the places where I think we now,

as an industry, can sort of start to turn and focus more on is

businesses aren't driven by these daily optimizations, and and businesses don't win because they made the right daily optimizations. It certainly helps, but the big bets are often the things that determine the winners. Like Jeff Bezos has this line that in business, you wanna swing for the fences because unlike in baseball where the best you can do on a single swing swing is score 4 runs,

In business, if you if you, you know,

take the right swing, you can score a 1, 000 runs. Like, there there's such a long tail of of positive outcomes from from decisions

that it's worth it to take some big bets and make these big bets because

the outcomes of those can be sort of way better than any kind of like small optimization.

And I think that that

we haven't really had data technologies that's focusing on folk people

figuring out how to make those big bets. There's a lot of, like, exploratory work that analysts do to to really try to understand, like, what is the big bet that we should make. And I think that's that's 1 of the places that that I'm excited for folks to be able to go next. It's not just, alright. We are data driven because we have dashboards. We are data driven because everybody's able to look at a chart every day. But how do we become data driven about the big bets? How do we become and that that I think is really how do we enable data analysts and data scientists

to to answer questions more quickly to be able to explore which big bets work out? Like, ultimately, the the way that I think you went on making big bets is by being able to make more of them and making them smarter.

And so the way I think you can do that is is basically, like, more quickly research these problems and understand what might happen if you make these changes. And those aren't things that a dashboard will tell you. Those are the things, like, in-depth analysis will tell you. And so I you know, as as an industry, I think that's that's a lot of 1 of the places we can go next is not just enabling, again, the the how do we optimize small things, but how do we how do we, like, uncover

the really big opportunities,

that are still very much kind of a in a boardroom type of conversation these days. And are there any other aspects

of the ways that data engineers and and data analysts can work together effectively that we didn't discuss yet that you'd like to cover before we close out the show?

I think

I I get I think it mostly stands behind

knowing your customer,

and knowing the problems they're trying to solve. That that it's it's really getting into

to knowing exactly what it is that that they're trying to do, and how they use the products that you build.

That, again, this is I think learning from from engineers or not from engineers, excuse me, from product managers and designers, is is a super valuable thing,

because because they've that's the problems they solve is is learning from their customers. And I think that data engineers can can, in a lot of ways, do the same thing.

1 other thing that I think, is a place potentially where data analysts or data engineers and and so this isn't really necessarily data engineers, but a place where organizations can think a little bit about data engineering today,

is don't hire folks too early. Tristan Handy, I think, I think it was Tristan, wrote a blog post about this that was basically focusing on this idea that

that a lot of organizations will think they need to data engineer before they do. And with all these tools out there now, with the ETL tools, with sort of point and click warehouses that are super easy to set up, with how well these tools scale,

that Stitch and Fivetran and those other tools can scale to to pipelines that are plenty big for most companies. Snowflake and BigQuery and Redshift can scale to a database that's plenty big for most folks. It's like your data engineering problems

are often not going to be that interesting or complex.

They're problems that that having someone who who owns is important, but they often don't need to be someone who is a dedicated data engineer. And I think there is a way in which companies can, like, hire for a data engineer too quickly,

because

because they serve a role they think they need, but the data engineer ends up basically being an administrator of a bunch of third party tools. And that's that's not a role that a data engineer wants. A data engineer wants to solve hard problems. They wanna be able to work on interesting stuff. They don't wanna be someone who's, you know, checking to make sure that that Stitch is running every day or or that your airflow pipelines,

you know, check the dashboard. Yep. Airflow ran again. Looks good. Like, that's not that interesting of a problem. And I think it's being sort of honest about what the role is and where you need people to come in before you actually hire someone who's who thinks they're coming in to to figure out how to scale this spark cluster to something huge when in reality they're, like, just checking Snowflake,

your Snowflake credits every day and being like, yeah, okay. We're still using it at the same rate that we need to use. So I think that that's a that's kind of a big shift is is you can get pretty far with the out of the box stuff now. So for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective

on what you see as being the biggest gap in the tooling or technology that's available for data management today? So I think 1 is is the piece that we've talked about of

their how do you monitor pipelines? And not monitoring in in a strict sort of DevOps sense, but monitor in knowing again when I have a question and I see something out of place, I can very quickly tie down whether or not it was because a date of changes, whether or not it was because some assumption that I made got invalidated, whether or not it was because,

a data pipeline didn't work or a pipeline ran in a way that was an ordering that was unexpected.

All those sorts of things I think are are super valuable and save analysts tons of time from actually having to dig through kind of the weeds of these problems. There's another place that I think we're starting to see some movement,

but we still sort of don't have a a real solution for, which is

a centralized modeling layer, essentially, that when you think about how data gets used around an organization,

it's not consumed by 1 application.

That as a business, I have a bunch of data. Typically, folks now can centralize that data into into data lakes or warehouses, putting it all in s 3 and putting Athena on top or Snowflake or whatever. But then you have to consume that and say you wanna you wanna kind of model, like, what is a customer. That's that's a problem that's sort of a a traditional BI type of problem,

but most BI models are models that only operate within the BI application.

And because data now is spread so much through an organization,

the model of what a customer is is something that needs to be centralized. It needs to be something that's available to

engineers who are using APIs to to, you know, pulling data out programmatically to define something in the product. It needs to be available to data scientists who are building models, on top of that to to forecast, you know, revenue or or build in product,

recommendation systems. It needs to be available to an executive who's looking at a BI tool to how many new customers we have every day. Like, all of these different applications

require this kind of centralized definition of what is a customer.

And and a tool like dbt is kind of moving in the right direction, but there's still not a a great way to kind of unify concepts,

within a data warehouse like that,

and in a way that it can be consistent. So you end up, as as someone consuming that data, having to rebuild a lot of these concepts in different ways,

or which which ends up kind of creating all sorts of problems of of what is a customer in 1 place isn't quite what's a customer in another. I don't think we've quite figured out that that part of the layer. Like, we have the warehouse layer. There's very robust applications that that can sit on top of the warehouse,

but they all kind of feed into the warehouse through different channels and through sort of different

business definitions of what this data means. And without that centralized layer, you're always gonna have some confusion over over these different definitions.

Well, thank you very much for taking the time today to join me and discuss the work that you've been doing at Mode and your experiences

of working with data engineers and trying to help bridge the divide between the engineers and the analysts. It's definitely a very useful conversation

and something that

everybody on data teams should be thinking about to make sure that they're providing good value to the business. So I appreciate your time, and I hope you enjoy the rest of your day. Thanks, Bryce. Bryce. Thanks for having

me.

Listening.

Don't forget to check out our other show, podcast dotinit@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links