Building Tools And Platforms For Data Analytics - Episode 95

Summary

Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on the Data Engineering Podcast? Do you have some ETL jobs that need somewhere to run? Check out Linode at linode.com/dataengineeringpodcast or use the code dataengineering2019 and get a $20 credit (that’s 4 months free!) to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Benn Stancil, chief analyst at Mode Analytics, about what data engineers need to know when building tools for analysts

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing some of the main features that you are looking for in the tools that you use?
  • What are some of the common shortcomings that you have found in out-of-the-box tools that organizations use to build their data stack?
  • What should data engineers be considering as they design and implement the foundational data platforms that higher order systems are built on, which are ultimately used by analysts and data scientists?
    • In terms of mindset, what are the ways that data engineers and analysts can align and where are the points of conflict?
  • In terms of team and organizational structure, what have you found to be useful patterns for reducing friction in the product lifecycle for data tools (internal or external)?
  • What are some anti-patterns that data engineers can guard against as they are designing their pipelines?
  • In your experience as an analyst, what have been the characteristics of the most seamless projects that you have been involved with?
  • How much understanding of analytics are necessary for data engineers to be successful in their projects and careers?
    • Conversely, how much understanding of data management should analysts have?
  • What are the industry trends that you are most excited by as an analyst?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Click here to read the raw transcript...
Tobias Macey
0:00:10
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at winnowed. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LI and ODE today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in database is streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers then don't miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity, Caribbean global intelligence and data Council. Upcoming events include the O'Reilly AI conference, the strata data conference, the combined events of the data architecture summit in graph forum and data Council in Barcelona. Go to data engineering podcast.com slash conferences, to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey. And today I'm interviewing Ben Stansell, Chief analyst at mode analytics about what data engineers need to know when building tools for analysts. And just full disclosure, that mode is a past sponsor of this show. So Ben, could you start by introducing yourself,
Benn Stancil
0:01:52
I am one of the founders and chief analyst of node node builds a product for data and also data scientists. So I'm responsible for both our internal analytics here at mode, as well as working a lot withour customers to help them better understand orhelp us better understand the needs that they have in the product, and how we can we can better serve them as an analyst data scientists. So prior to mode, I worked on the analytics team, Yammer, which was a startup that was purchased by Microsoft in 2012. And before that, my background is in economics and math. And so I actually worked for a think tank in DC for a few years doing economic research before landing in San Francisco and
Tobias Macey
0:02:29
in the tech world. And do you remember how you first get involved in the area of data management?
Benn Stancil
0:02:33
Yeah, so it was actually as a customer, really. So I was working as an analyst at Yammer, my first job and tech was was a gamer. And I was really a customer over data engineering team. So we use the tools that they built, as well as the data that they provided to Yammer was was kind of one of these early leaders and the philosophy that engineers shouldn't build ETL pipelines, which is now something that's become a little bit more popular. There's a blog post, from the folks over at Stitch Fix that talked about this very explicit. But Yammer had this the same philosophy. And so while I was there, we were responsible for building our own pipelines and for sort of dipping our toes into the data engineering and data management world. And so that was kind of my first taste of it. Then after leaving Yammer, and starting mode, which I've mentioned is is a product for data analyst data scientists, I actually ended up taking kind of two further steps into data management. First, I'm responsible for our own data infrastructure here at mode. And so my role is to think about not just the analytics that we do, but how we actually get the data in the place that we want to get it. But also a lot of ways, mode is serving the same problem or serving the same providing the same service that our the Yammer data engineering team was providing me when I was an analyst, which is we are now building tools for other data scientists and so that the product that we provide, we very much have to think about how does it fit into the data management ecosystem? How does it solve the problems that not just analysts data scientists have, but the problems that data engineers have, when they're trying to trying to serve those,
Tobias Macey
0:03:58
those customers. And so you've mentioned that your work at mode, you're actually responsible for the data management piece, and that you're working closely with other data engineering teams and other analysts to make sure that they are successful in their projects. And I'm wondering if you can start by describing some of the main features that you are generally looking for in the tools that you use and some of the top level concerns that you're factoring into your workout mode, and the tool that you're providing to other people?
Benn Stancil
0:04:27
Yeah, so internally at mode, the one of the things that we really care about is, we want to make it something that is easy to use for the analysts and data scientists are actually consuming that data. So again, kind of come back to the to the point from that stitch pitch, stick fix blog post, we really believe that that the data scientists here at mode should be responsible for as much kind of data management as possible that there's a lot of great tools out there now, that are ATL tools or warehouse tools, or pipeline tools, that analysts can manage pretty well. And you don't have need someone to be kind of a dedicated capital E engineer to really build out the initial phases of the pipeline. And so for us, when we, when we evaluate those tools, internally, we want to make sure that that there are things that we can set up pretty easily. And there are things that as customers have those tools who aren't necessarily the, again, fully fledged engineers, ourselves, we still know how it works, and can still make sure that it's up and running and performing the way we want. I think the analogy we often use with this is it's like buying a car, that you don't necessarily need to know the ins and outs of how the car works. But you need to know that it's reliable. And if you learn to not trust the car that is not actually going to drive and learning to drive. You don't want to actually learn how to fix the car, you just want to buy a different car that actually works. And so when we're when we're looking for tools ourselves, we tend to focus a lot on that on like, what's the experience like for for the folks who are using it? Can we rely on it? And is it something that we need to, you know, have a dedicated person to run? Or is it something that we can kind of run in the background and the the analysts,
0:06:01
the data scientists can get it to work the way I like to work.
0:06:05
The other thing I think that we really look for is usability. So I think this is a place where where ETL tools and data pipeline tools, the folks who are building them often often don't think about as much as perhaps they could, which is the surface area of those tools isn't the application itself or the web interface, I really think of the surface area those tools as the data itself, that that if I'm using an ATL tool, the way that I interact with that tool day in and day out is by actually interacting with the data that that tool is providing, not by logging into the web interface and and you know, checking the status of the pipelines and things like that. And so in those cases, little things matter, it ends up being column names that matter, like, Are there weird capitalization schemes, or are there periods and column names, and those little things that make it more frustrating to work with that day in and day out, end up being things that really drive kind of our experience with those tools, working with customers. And so most customers range from being being small startups to much larger enterprises. I think for small startups, they often look like us. For the large enterprises, the place that we really try to try to focus is making sure that the tools that we recommend your modular that data stacks end up becoming very complicated, they end up having to serve a lot of different folks across a lot of different departments, pulling data from tons of different sources. We try to avoid people focusing on like one tool to rule them all. This kind of having one pipeline, one warehouse, one analytics tool, all of these things serving every need, I think is often a it sounds nice, it's often very difficult to actually create that. And we'd rather people be able to kind of modularize different parts of their stack so that if something new comes along that they want to use, they can easily swap something else in and out without having to kind of re architect the entire the entire pipeline.
Tobias Macey
0:07:53
A couple of things that came out at me, as you were talking, there are one being that you're talking a bit about some of the hosted managed platform tools, where anybody can just click a button and have something running in a few minutes. And then on the other side of the equation, particularly if you have a very engineering heavy organization or a larger organization, you're probably going to be having dedicated data engineers building things either by cobbling together different open source components or building something from scratch. And I'm curious what you have found to be the juxtaposition as far as the level of polish and consideration for user experience of the analyst at the other side of the pipeline, as you have worked with different teams and different tools that fall on either side of that dividing line of the managed service versus the build your own platform.
Benn Stancil
0:08:46
So the the managed service, it depends on the tool, I think we've some tools do a really great job of this some tools less so I think that's it's probably true for any products. And in a similar space, some some products do a really great good job thinking about the experience for customers and others are more focused on technical reliability or more focused on on other aspects of of that product experience. You know, I think that that for an example of of one place where I think like, there's a great tool that focuses on some things that didn't work really well, but also has one of these pain points, snowflake, for instance. So we're actually snowflake customers, the database, it's a very powerful tool for us, we recommend it to both anybody, but they they I believe in the tradition of the Oracle folks from which it came, all of their column names are automatically capitalized. And so it's just one of those small irritations that, that it seems like when they developed it, it wasn't necessarily a consideration of how our analysts going to be interacting with this day in and day out, when all of your queries are constantly yelling at you, because you have to
0:09:45
capitalize everything. So
0:09:46
little things like that, I think are places where companies could could think a little bit more about the the ways that people use it from from the perspective of internal tools. And the folks who are building these from from scratch, I actually think that a lot of cases those tools to to be better in the ways that they think about user experience, because the people who are building them are sitting directly next to the customers that they're providing it for that if you're an engineer, and your customers, the desk over from you, all of those things, there's like a really fast back and forth between how do you actually use this, you see somebody use it every day you hear them complain about it, they like
0:10:19
get the benefit that vendors don't get of
0:10:22
literally working with their customer day in and day out, and their customer being able to like walk over to their desk and say, Hey, this is a thing that can you change and stuff like that. So while the internal tools often aren't as technically robust, and often artists is reliable, and a lot of other respects and aren't nearly as powerful and flexible. The small things often work a little bit better, because they were custom built for for exactly
Tobias Macey
0:10:42
that audience for the things that you're talking about that contribute to just an overall better experience for the analyst. Things like managing the capitalization of column names or preventing the insertion of dots in the column name that will break a sequel query, what are some of the guardrails that data engineers should be thinking about in terms of how their tools are able to generate specific types of output or the overall ways that they can be thinking of the end goal as they're building these different tools and pipelines that would make it easier for analysts and other people within the business to be able to actually make effective use of the end result of their work?
Benn Stancil
0:11:23
Yeah, I think it's, it's being a product manager and a
0:11:25
lot of ways it's it's doing the research and knowing your customer,
0:11:29
that those little things
0:11:30
aren't things that are necessarily going to be obvious. And it's very difficult to sort of build a framework to show you exactly how those things will work. I think the the best framework is the frameworks that product managers or designers build of, of how do you understand the needs of your customer? How do you engage with your customers and learn from them
0:11:48
that you know, even even as an analyst
0:11:50
and as someone who lives in these tools, day in and day out, and is the customer of those a lot of respects, I don't know that I could sit down and make a list of here's all the little things that I like or don't like it something that you very much realize, as you're working on it, in the same way that I'm sure for engineering products are tools that are built for engineers, they have opinions about about how those should be built, but don't necessarily have like, an ability to just write down these are this is the framework for understanding all these things. And a lot of it is wanting to build something that that your customers like and then taking the time to listen to them and understand sort of, what are these pain points they're having? And where do they come from? Like, why what are you trying to accomplish? When you do it? I think a lot of it is sort of the fundamental aspects of of product management, and
0:12:32
user research to really get at the core of what those problems are,
Tobias Macey
0:12:35
from the perspective of team dynamic or an organizational approach to managing these data projects. How much responsibility Do you feel lies with the data engineering team to be able to ensure that they're considering these end user requirements, as far as the capitalization of schema names or things like that, when they're also concerned with trying to make sure that there is able to obtain source data, they want to make sure that their pipelines are reliable and efficient. And they have, you know, maybe n plus one different considerations for the overall management of the data itself before it gets delivered. And how much of it is a matter of incorporating analysts in that overall workflow, basically, just trying to get the breakdown of where you see the level of understanding and responsibility for identifying and incorporating these UX points in the overall lifecycle of building these products.
Benn Stancil
0:13:32
So generally, I would say like, the tool needs to work on the other aspects and recognize that a lot of the data engineers are building these products or either internally or as vendors, there's lots of very complicated problems they have to work on. And I think most analysts would recognize that as well. I think the responsibility doesn't necessarily lie. And for the, for the engineers building these tools to go out and determine all this stuff on their own and not get the help of their customers to go to tell them to do that.
0:14:00
I think that that the thing that is the responsibility of the tool builders is more just having the empathy of
0:14:07
the customer that using that it's less about, you know, I need
0:14:09
to go figure out what are these these little things is usability issues or other things like that, that are going to be the things that get in the way of my customer using this every day? Yeah, they should be, they should be sort of more which is willing to listen, when somebody has that feedback, and and recognize that those sorts of things are also things that will affect how somebody uses the tool that they build. So I think, you know, you can't again, as within any product, I don't believe you can build something that's a a technical Marvel, if it's not something that people want to use, there's there's plenty of examples of of tools and companies that have done this in a focused on, you know, if I build this, this monument, to some technical expertise, and people will come use it, well, you know, not really like, like people will use it because it helps them solve a problem. And and while you need to be able to figure out a way to balance those two things, I do think it's there's there's some empathy to to the customer summer that's necessary there of what does it that makes you want to use this thing every day yet, all of the sort of upstream technology that requires it is very important, obviously, if there's super organized and super clean and super sort of well defined data, but there's not any data in there, because nobody actually was able to get the data from the source into the to the warehouse or whatever, obviously, nobody's gonna use that either. But I think it's, it's, you know, it's important to keep in mind that usability matters and and you have to have the empathy for for your customers as your as you're building this product.
Tobias Macey
0:15:28
And ultimately, from the few things that we've brought up as examples, they're fairly small changes that don't really include any additional technical burden on the people building the pipelines, it's just as you said, a matter of empathy. But from your experience of working with other data engineers, and with your customers and different data teams, what are some of the common points of contention that you have seen arise between data engineers and data analysts as far as how they view the overall workflow or lifecycle of the data that they're trying to gain value from?
Benn Stancil
0:16:00
So So yeah, the examples were also were simple ones, I think there are places to especially this happens more, I think, in in the internal use cases, where there are more complicated things that are, you know, an analyst is kind of trying to frequently solve a problem in a particular way. And maybe they want that or mapping software or something like that, because because that's a way like exact often ask questions, and they just want a quick way to be able to visualize something on a map, but rather than having to format it, take it away, and load it into Tableau and do that. And so there may be more complicated things there, which I think is is again, kind of a product question for the data, our data engineers, rather, to figure out how hard is that to build and how much value is really providing an understanding, again, kind of the use cases behind the request in terms of in terms of these, like, sort of points of conflict, or where people can align? I think the one of the things I think that's that's really important is for data engineers to understand how data scientists and analysts think that that again, it's really understanding your customer. But it's not just understanding, I need to be able to deliver dashboards to executives and answer these challenging questions. It's, it's understanding kind of who your customers are and where they come from. And I think there's a couple a couple of big things that that are sort of define a lot of analysts that I think are, like, critical for for thinking about how you build tools for them. What is there they're trying to solve problems, like quickly and often trying to, to answer questions quickly, and kind of rough ways that, that they'll get a question from an exact it's like, why is this marketing campaign not working? And they're not trying to answer that, scientifically, they're trying to turn around something so that the business can make a decision and to an engineer or to a statistician, or to somebody who's who's, you know, focused on on building robust tools, the way that an analyst work, work may look sloppy, it may look like something where they're not crossing teeth, or not dotting I's, you know, they're very quickly trying to do something rough and sort of hacking their way to a solution. But in a lot of cases, that's the whole point like that is the value of an analyst does is is take complicated questions, distill it down to something pretty quick, ship it off to somebody who's making a decision and help the business make a decision and move forward. And so in a lot of ways, I think the the, the ways that the tools get built and the ways to sort of remove friction, or to understand not just the problems are trying to solve, but the kind of mindset behind it, which is this. All right, I have a question. How do I answer it? How do I like draw conclusions from these observations? It's not how do I build a logical model? How do I build sort of the most mathematical thing possible? How do I abstract a complex system, it often feels kind of rough and sloppy, but to an analyst. That's,
0:18:37
that's the job. And
Tobias Macey
0:18:39
so far, we've been talking about the API between data engineers and data analyst as being the deliver data probably sitting in a data warehouse somewhere. But on top of that, you've also built the mode platform. And there are other tools such as read ash, or Supercell, that exists for being able to run these quick analyses and be able to write up some sequel, get a response back, maybe do some visualization. I'm wondering as far as the way that you think about things and also the way that you've seen customers approach it where the dividing line is, in terms of the platform and tooling that data engineers should be providing, versus where the tooling for being able to perform these rapid analyses lives in terms of who owns it, and who drives the overall vision for those projects.
Benn Stancil
0:19:27
Yeah. So I think that kind of has to be a joint a joint effort, I, one of the failure modes here, I think can be just kind of throw it over the wall, you know, you build the tools, I'll consume the tool, where these two teams are tightly synced, that that I think it's important for them to be able to sort of have similar focus on the same problems that that it's not, it's not just for like data engineer shouldn't just be thinking about my objective is to build a tool. I'm a believer, that data engineer should sit very closely to data scientists or data analysts, and basically have the same, the same KPIs. The the objective should be how do we answer the questions we as a business need to answer and a data engineers job is to to enable that their job isn't to say, like, all right, I've, I've, you know, hit my KPIs because I delivered a tool, they should be trying to serve the same needs as the data scientist. And so I think it's, if you end up with the kind of, Okay, we'll build a tool, and somebody else will, will take it and consume it, I think you end up with this disconnect, where where the analyst aren't able to actually like deliver the quality analysis they need to deliver. And their ends up being a lot of friction at that at that, like touchpoint, between the two, because analysts are looking for a particular thing, they come back to the data engineers asked for data engineers, you know, feel like they're being sort of told what to do. You want to be able to allow these groups to be a little bit more autonomous, I think the only way you could do that is allow them to be invested in the same result. So it's, it's enabling engineers to understand like, what is the value of that product? Not just to the analysts, but how does it provide value to the to the broader entire users around the company? And so in cases, like super sad, take the if I'm a data engineer building, you know, implementing superset and a company, I want to understand not just Okay, great. They want superset plugged into these database data. It's what questions are you trying to answer to that? You know, at what frequency? Do you need to do it? Who's the customer of those customers? questions you're trying to answer? All those things help drive some of the decision making behind, you know, how do I Where do I? How do I get it up and running? Is it something that everybody needs to have access to? Is it something that just analyst needs to have access to and something needs to be shared easily, there's a lot of a lot of work there, I think that is in this gray area between the two and I think those those two groups need to be like open to that gray area, rather than sort of perfectly defined, you work on just these things, I work on just these things.
Tobias Macey
0:21:39
Yeah, in some ways, it's changing the definition of done where in one world, the definition of done for a data engineer is I've loaded the data into the data warehouse, I have washed my hands of further responsibility, it's now in somebody else's core versus the definition of done is I was able to get the data from this source system all the way to the you know, chief executive who needs to be able to use the resulting analysis to make a key decision that will drive our business to either success or failure and aligning along those business objectives rather than along the sort of responsibility objectives, which is one of the recurring themes that's been coming up a lot in the conversations I've had on this show, as well as a lot of the themes that have been arising in the division between software engineers and systems administrators that lead to the overall sort of DevOps movement, and just a lot more work in terms of aligning the entire company along the same objectives, rather than trying to break them down and along organizational hierarchies.
Benn Stancil
0:22:36
Yeah, I agree. And, you know, I think what one, this is a kind of, again, getting back to some small details, but one example of this actually comes to mind where this this broke down was it had a team like a data drink team that was very much focused on loading data to warehouse and analytics team that was responsible for like taking that data and passing it off to somebody else and answering questions with it. And there was like, a column name, and I believe the column that was updated at because I think that's like, it's just like a system table that, you know, a lot of like Rails apps and things like that have updated at timestamps that their their system generated. That was that was put into the warehouse, because that was what it was to an engineer, it made perfect sense. They put it in there, it was like, it's clearly named all of that to an analyst or data scientist, they interpreted that to mean something different without the kind of like understanding where they're trying to go with the problem downstream from it. It was a it was like a clean handoff, that to both sides is like updated, I know exactly what that means. Both sides thought they knew exactly what it meant. And then they ended up creating these analyses on top of that, with the assumption that that column means something that didn't mean and just by having this like, okay, you take the ball and taking that you run with it, and and I'm sort of, like you said, wash my hands of it, and then I'm driving from a very bad result. And so that probably could have been fixed very easily. But because the team was wasn't focused on kind of the end result of the business objective, the data engineers never actually saw the analysis that was being produced, they never really understood the questions that were being asked, they assume, okay, you're using it the way you should be using it, instead of stepping back and saying, like, All right, let's kind of work on this the actual problem together to make sure that we're solving this problem in the right way.
Tobias Macey
0:24:04
This is where another one of the current trends comes into play of there being a lot more focus on strong metadata management and data catalogs and being able to track the lineage of the data so that you can identify Oh, this updated at column wasn't created for the data warehouse that actually came from the source database. And I understand now a bit more about the overall story of how this came into play versus having these very just black box approach of all I know is, this is how it is now. And then also, in terms of metadata management, being able to know how frequently is this data getting refreshed? When was the last time it was actually updated in the data warehouse versus whatever this updated at column is supposedly pointing to that I know, potentially misinterpreting?
Benn Stancil
0:24:48
Yeah, I have a, I have a somewhat negative view of documentation on these things. I think that that is a noble goal. But it's really hard to maintain that, that if you have sort of manually created documentation was sort of data dictionaries, where people write down you know, this column means that and stuff like that, it's, it's often really hard to keep that that up to date, like people are now adding data sources. So quickly, data is evolving so quickly, that that often will lag. And in a lot of cases, a data dictionary that's out of date is more dangerous than no data dictionary at all, because people will go to it, oh, here's the data dictionary, I assume that this is correct. And, and it's actually a month behind where it should be. And so people are, are confidently making a decision off about a date information. Rather than looking at it with a little bit of skepticism. This is actually in my mind, a problem that hasn't really been solved. I know there's some some vendors out there that are attempting to do this. But what's the company called, I know, blanking on the name, but but there's a company that is attempting to do this through kind of using the patterns of how people are actually using data to document it sort of this automatic documentation that happens in the wake of of usage, rather than people having to manually create it. I think that's whether or not that technology works, sort of remains to be seen. But I think that is the right way to think about documentation where documentation is really
0:26:03
a
0:26:04
product of the way that people use something. And really, the way that what I have joined teams or had new folks join teams, the best way that they learn about how different pieces of the data schema works or the arc, like how things are connected together, is often from seeing how problems have been answered by other people and mirroring that it's it's like documentation based on on actual usage and documentation sort of centered around the the ways that people define concepts, rather than documentation based on some giant Excel file that is like this column is of this type with this data, there are a few folks I've seen that pulled that off, but for the most part, it just becomes a huge time sink to invest in it. And something that almost always ends up lacking. So that that is a tricky problem. I you know, I think that's something that that over time folks may figure out. But definitely one of those things that for now has is almost has to be a little bit of we learned by doing rather than rather than we learned by leading a manual to know exactly what these things are, there probably are some places where you could you could include sort of like a common pitfalls type of thing of like, don't use this updated at timestamp or this thing it says month to month, but in reality it's not don't trust it, you know, this sort of little little like, gotcha kind of things but but like a broader documentation is something that we haven't seen anybody implement terribly well, to this point.
Tobias Macey
0:27:22
I definitely agree that the idea of the static documentation as far as this is where this comes from, this is how you use it is grounds for a lot of potential error cases, because of as you were saying it becoming stale and out of date and no longer representation of reality. I was actually thinking more along the lines of the work that the folks at Lyft are doing with Amundson for being able to use that for a data discovery and having a relevance ranking as far as how often it's being used, or the work that we work is doing with Marquez, where they're integrating a metadata system into their ETL pipeline so that it will automatically update the information about what a table was last loaded from source? And what are the actual steps that a given piece of data took from the source to the destination where you could look up the table in the data warehouse and then see what are the actual jobs that brought it there? And when were they last run? And were there any errors to be able to get a better context, from the end user perspective, as far as what was the path that this data took so that I can have a better understanding about where it came from? And how I might actually be able to use it effectively?
Benn Stancil
0:28:29
Yeah, I think yeah, I think those are super interesting projects. And and there's a recently, a company called elemental released an open source tool called Dexter, that kind of follows in that same same pattern of trying to create, make pipelines that look a little you know, you were able to sort of parse your way through the little better and diagnose kind of, Oh, this thing went from step one to step two, step three, and I think that that stuff, I think it can be super interesting for for analysts and data scientists, because one of the one I think the big missing pieces in data stacks, and it's it may be solved by these maybe not is, if I'm working on a question or if I get asked a question, I'm start investigating some data and something looks something looks awry, like something doesn't quite look right. There's always a little bit of the back of my mind that makes me think, I wonder if this is a data problem. And and you're never able to quite escaped that. And and part of these, I think that's true is like pipelines are are notoriously fragile, you're always going to like miss some data, there's always little things that that you, you have to go through this process of like, this, this result doesn't quite make sense. I wonder if I'm double counting something because this one to one mapping that I thought was in place actually isn't that something got written away where we thought we had, you know, one Salesforce opportunity per customer, but it turns out, a second one got created somehow. And you have to kind of go through this, this sort of down this rabbit hole of of checking your data in various ways. It's not just like, was it loaded properly, but all of these other like unit test type of things that I don't I don't know, necessarily how you quite avoid it. They're probably technologies to build to help a lot with that. But there's, there's nothing really in place that that gives an analyst or data scientist, once they look at something for confidence that yes, this is this is something that I understand I know exactly what it is. And I need to investigate the business part of the problem that this data is telling me rather than well, should I like check and make sure everything is working before I before I go too far down trying to understand why the thing happened that I think may or may not have happened. And so any step to me that moves in that direction, whereas Amundson whether or not it's it's the thing that folks that we worked folks are building, whether or not it's it's sort of a unit test type of type of tools, that DB T is build all those things, I think that provide a little bit more confidence. And these are, I can now check off some things on the list that I was going to have to go check to make sure things are right, the faster you can get through that the faster as an analyst, I can focus on again, solving the business problem, rather than kind of bending your head against like, Is it a data problem? Like what do I not know, before I before I want to go take this to an exact get him inside? Oh, my God, you know, look, our revenues doing great this quarter. And then you know, the last thing you want to come back is like, well, you actually had a data problem. And that thing that I told you wasn't true, because I, you know, failed to investigate this one pipeline that did a thing that I didn't expect. That's
Tobias Macey
0:31:12
actually a great point to to be made, as far as the relationship between data engineers, and data analysts and ways that data engineers can help make the analyst job easier is actually making sure that they are integrating those quality checks and unit tests and being able to have an effective way of exposing the output of that as well as incorporating the analyst into the process of designing those quality checks to make sure that they are asserting the things that you want them to assert. So in the context of the sort of semantics of distributed systems, there's the concept of exactly once delivery or at least once delivery or at most once delivery, and that understanding how that might contribute to duplication of data or missing data, and what are the actual requirements of the analyst as far as how those semantics should be incorporated into your pipeline? And what should you be shooting for? And how are you going to create a certs whether it's using something like DVT, as you mentioned, or the Great Expectations project, or some of the expectations, capabilities that are built into things like data lake, and then having some sort of dashboard or integration into the metadata system, or a way of showing the analyst at the time that they're trying to execute a query against a data source, these are the checks that ran, these are any failures that might have happened so that you can then take that back and say, I'm not going to even bother wasting my time on this, because I need to go back to the data engineer and tell them, this is what needs to be fixed before I can actually do my work.
Benn Stancil
0:32:32
I very much agree with that. And I think that there's a lot, a lot of time get sunk into, you know, there's the common lines of like data, scientists spend 80% of the time cleaning data and all that, I think that that number obviously varies a lot. And for folks who are using machine generated data, you know, if you're using data, that's event logs and things like that, you don't spend that much time cleaning day, like machine generated data is not particularly dirty in the sense of, I have to, you know, clean up a bunch of political donation files that are all like, manually entered, like that's dirty data, where you have to figure out a way to take these 15 dresses that are all supposed to be the same and and turn them into one. But machine generated data doesn't have that problem. But it has a problem of, can I rely on it? Like did it? Did it fire these events properly? Did it fire them 10 times when I was supposed to fire them once? Did it miss a bunch?
0:33:17
And so anything? I
0:33:18
think that yeah, it can help sort of that cleaning problem of of understanding exactly what I'm looking at, and how how much does my data represent the truth is something that you always get as an as an analyst, you always have this in the back of your mind that like, this isn't quite truth. This is this is the best we have. And in a lot of cases, I think it's close. But I always have to be a little skeptical that it's that it is truth. And so the pieces there that that can help are a big value. Another thing too, that I think actually data engineers can do a lot for for analysts and data scientists as well is like, engineers have solved a lot of these problems or have thought about solutions for solving these problems, in ways that folks that come up through through sort of activity, little channels into being an analyst or data scientist happen. So like, one of the interesting things about about data scientists roles is people come from all different backgrounds, like you'll be working with someone who's, who's a former banker, who, you know, it's super deep and Excel, but it's just learning sequel, you'll be working with someone who's a PhD, who's written a bunch of our packages, but has never actually used production warehouse, you'll be working with someone, as a former front end engineer who got a data visualization, you'll be working with an operations specialist who's been writing Oracle scripts for 10 years. Like there's no consistent skill set for where these folks come from. And so the idea of even writing something that amounts to a unit test or writing something that lets you that the concept of version control with get are also things like those sorts of concepts aren't things that are necessarily going to come naturally to folks and analytics and data science. And so I think there's there's places where data engineering can kind of push some of those best practices on to to the way that analysts data scientists work. And I think this is data engineers can do it, vendors can do it, there's lots of different ways that we can, we can sort of standardize some of that stuff. But there's definitely a those sorts of practices, I think that that can come from the engineering side of kind of this, this ecosystem, to really encourage folks to be able to learn those things are folks be able to push in that direction to learn some of the pieces there that are valuable for their jobs.
Tobias Macey
0:35:14
Another element of the reliability and consistency aspect, from the analyst perspective, when working with the data is actually understanding when the process of getting that data has changed. So you mentioned things like version control. So if I push a release to my processing pipeline, that actually changes the way that I am processing some piece of data, and then you run your analysis, and all of a sudden, you've got a major drop in a trend line, and you have no idea why you're going to be spending your time tearing your hair out of maybe I did my query wrong, or you know, maybe something else happened, and just being able to synchronize along those changes, and making sure that everybody has buy in as to how data is being interpreted in process so that you have that confidence that when you run the same query, you're going to get the same results today, tomorrow and the next day. And then all of a sudden, when that expectation is not met, you start to lose trust in the data that you're working with, you start to lose trust in the team that is providing you with that data. So just making sure that there is alignment and effective communication among teams for any time those types of things come into play, I think is another way that data engineers can provide value to the analysts.
Benn Stancil
0:36:22
Yeah, like it as a very concrete example of this. And that trust, I also think extends further and I think there's if as a data engineer, again, if your goal is is to build a an organization that's focused on the products that you're building, or that has the mentality that the products are building matter, you also need to think about that, because so as a concrete example of that, say, you have like a a revenue dashboard and revenue and companies, we've worked a lot of companies like designing and figuring out how much you make, it seems like it'd be the simplest thing. It's like one of the most important numbers in a company has, it's always impossible, nobody does this. Well. It's always like, there's a ton of these weird edge cases, data coming from tons different places it's coming from a system is coming from Salesforce, which has all this manual entered stuff. So it's this kind of nightmare of of a process to just figure out, like, How much money did we make, but say you say you have like a revenue dashboard. And it says today that you made a million dollars last quarter, and then tomorrow, it says you made one and a half million dollars last quarter. And as it as like an analyst, that's that's your nightmare scenario, because now an executive saw this, they don't know which numbers right? They're mad at you. Because why in the world of this big change, they just told the board it was a million dollars. And now it's like a million and a half and like, Is it going to go to 750? Tomorrow, and nobody's gonna know what happened. And so you end up having to dig through so many different pieces of that. It's like, did you write a bad query? Did a sales rep go into Salesforce? And like, Oh, they backdated a deal that that actually signed today? was their data entry problem. And Salesforce or somebody put in something wrong? You know, did you double count something that you didn't mean to double count? Was there a data pipeline problem where data got updated in a bad way? Like all of those things end up becoming just this part? How do I figure out what happened and often you don't have any real record of what the system was before you just know, it used to say, a million and now it has one and a half. And you're like, I have to figure this out. And and those are the types of problems that are the headaches for for data analysts and the ones that you end up finding yourselves in all the time. And the more systems that you can have in place that lets you say, Oh, yeah, it's one and a half, because we've acted in a deal or because you know, this other thing happened it, the faster you can add to that, the easier Your job is, but kind of more importantly, the more trust that the rest of the organization will have in your job, because you're not spending all this time trying to like explain a number and not sure which one you actually want to stand behind
Tobias Macey
0:38:32
from the perspective of somebody who is working so closely with data analysts and with companies who have data engineering teams, as well as consuming some of these managed services for data platforms. I'm curious what you have seen in terms of industry trends that you're most excited by, from your perspective, as an analyst and some of the things that you are currently concerned by the you would like to see addressed.
Benn Stancil
0:38:59
So I think that we've seen like, this isn't, this isn't really a technology, but
0:39:04
I think it's one of the places where the business sort of industry can go that generally industry has made like very big strides in enabling Kind of day to day data driven decision making, that we've done a lot about, you know, how do we get data in front of in front of people around the business? How do we get it to them quickly? How when I am making a decision as a sales rep, and who do I call today, or a marketing manager, you know, which campaigns do I focus on? How do I how do I do that? Like, how do I make that decision? And I think we've made a lot of progress in that. In that front, one of the places where I think we now as an industry can sort of start to turn and focus more on is, businesses aren't driven by these daily optimizations and businesses don't win because they made the right daily optimizations, it certainly helps. But the big bets are often the things that determine the winners, like Jeff Bezos has this line, that in business, you want to swing for the fences, because unlike in baseball, where the best you can do on a single Swing, swing score for runs, and business, if you if you, you know, take the right swing, you can try thousand runs, like there's such a long tail of positive outcomes from from decisions, that it's worth it to take some big bets and make these big bets, because the outcomes of those can be sort of way better than any kind of like small optimization. And I think that, that we haven't really had data technologies, it's focusing on people figuring out how to make those big bets. There's a lot of like exploratory work that analysts do to really try to understand like, what is the big bet that we should make? And I think that's, that's one of the places that that I'm excited for folks to be able to go next is not just all right, we are data driven, because we have dashboards, we are data driven, because everybody's able to look at a chart every day, but how do we become data driven about the big bets? How do we become and that that I think, is really, how do we enable data analysts to data scientists, to to answer questions more quickly to be able to explore which big bets work out like, ultimately, the way that I think he went on making big bets is by being able to make more of them, making them smarter. And so the way I think you can do that is basically like, more quickly research these problems and understand what might happen if you make these changes. And those are the things that a dashboard will tell you. Those are the things like in depth analysis will tell you. And so I you know, as an industry, I think that's that's a lot. One of the places we can go next is not just enabling again, the the how do we optimize small things, but how do we how do we like uncover the really big opportunities
0:41:27
that are still very much kind of a boardroom type
0:41:29
of conversation these days?
Tobias Macey
0:41:31
And are there any other aspects of the ways that data engineers and data analysts can work together effectively that we didn't discuss yet that you'd like to cover? Before we close out the show?
Benn Stancil
0:41:42
I think, I think it mostly stands behind, knowing your customer. And knowing the problems are trying to solve that it's really getting into to knowing exactly what it is that that they're trying to do and how they use the products that you build. That again, this is I think learning from from engineers or not from engineers, getting from product managers and designers
0:42:04
is a super valuable thing.
0:42:07
Because because they've that's the problems they solve is learning from their customers. And I think that data engineers can kind of a lot of ways do the same thing. One other thing that I think is a place potentially where data analysts or data engineers, and so this isn't really necessarily engineers, but a place where organizations can think a little bit about data engineering today is don't hire folks too early trust in handy, I think, I think I stressed and wrote a blog post about this, that was basically focusing on this idea that, that a lot of organizations will think they need to date engineer before they do. And with all these tools out there now with the UTM tools with sort of point and click warehouses that are super easy to set up with how well these tools scale, that stitch and five trade and those sorts of tools can scale to pipelines that are plenty big for most companies, snowflake can be queried redshift can scale to a database, there's plenty big for most folks, it's like your data engineering problems are often not going to be that interesting or complex. Their problems that that having someone who who owns is important, but they often don't need to be someone who's a dedicated data engineer. And I think there is a way in which companies can like hire for a data engineer too quickly, because because they deserve a role I think they need but the data engineer ends up basically being an administrator of a bunch of third party tools. And that's that's not a role that a data engineer wants, as an engineer wants to solve hard problems, they want to be able to work on interesting stuff. They don't want to be someone who's, you know, checking to make sure that that stitches running every day or that your airflow
Unknown
0:43:32
pipelines, you know, check the
Benn Stancil
0:43:34
dashboard. Yep, air flow ran again, looks good. Like, that's not that interesting of a problem. And I think it's being sort of honest about what the role is, and where you need people to come in, before you actually hire someone who's who thinks they're coming in to, to figure out how to scale this part cluster to something huge, when in reality, they're like, just checking stuff like your snowflake credits every day and being like, okay, we're still using it at the same rate that we need to use. So I think that that's a that's kind of a big shift is it you can get pretty far with the out of the box stuff now.
Tobias Macey
0:44:02
So for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Benn Stancil
0:44:18
So I think one is is the piece that we've talked about of their How do you monitor pipelines and not monitoring and a
0:44:25
strict sort of DevOps sense,
0:44:27
but monitor in knowing again, when I have a question and I see something out of place, I can very quickly tied out whether or not it was because I did have changes whether or not it was because some assumption that I made got it validated whether or not it was because a data pipeline didn't work, or a pipeline ran in a way that was an ordering That was unexpected, all those sorts of things, I think are super valuable, and save analysts tons of time, from actually having to dig through kind of the weeds of these problems. There's another place that I think we're starting to see some movement. But we still sort of don't have a real solid four, which is a centralized modeling layer,
0:45:03
essentially, that
0:45:04
when you think about how data gets used around an organization, it's not consumed by one application, that as a business, I have a bunch of data, typically, folks now can centralize that data into into data lakes or warehouses, putting it all into s3, putting Athena on top or snowflake or whatever, but then you have to consume that and say you want to you want to kind of model like what is a customer that's, that's a problem the sort of a traditional bi type of problem. But most BI models are models only operate within the big application. And because data now is spread so much to an organization, the model of what a customer is, is something that needs to be centralized, it needs to be something that's available to engineers who are using API's to you're pulling data out programmatically to define something in the product that needs to be available to data scientists who are building models on top of that to forecast you know, revenue or build in product recommendation systems. And it will be available to an executive who's looking at a BI tool to understand how many new customers every day, like all of these different applications require this kind of centralized definition of what is a customer and and a tool like DBT is kind of moving in the right direction. But there's still not a great way to kind of unify concepts within a data warehouse like that. And in a way that it can be consistent. So you end up as as someone consuming that data having to rebuild a lot of these concepts of different ways. Or it which which ends up kind of creating all sorts of problems of what is a customer in one place isn't quite what's the customer another, I don't think we've quite figured out that that part of the layer, like we have the warehouse layer, there's very robust applications that that can sit on top of the warehouse. But they all kind of feed into the warehouse through different channels and through sort of different business definitions of what this data means. And without that centralized layer, you're always going to have some confusion over over these different definitions.
Tobias Macey
0:46:48
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing it mode and your experiences of working with data engineers and trying to help bridge the divide between the engineer in the analysts. It's definitely a very useful conversation and something that everybody on data teams should be thinking about to make sure that they're providing good value to the business. So I appreciate your time, and I hope you enjoy the rest of your day.
Unknown
0:47:13
Thanks, guys. Thanks for having me.
Tobias Macey
0:47:20
Listening, don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language its community in the innovative ways that is being used, and visit the site at data engineering podcast.com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast com with your story. And to help other people find the show. Please leave a review on iTunes and tell your friends and co workers