Summary
The Data Engineering Podcast has been going for five years now and has included conversations and interviews with a huge number of guests, covering a broad range of topics. In addition to that, the host curated the essays contained in the book "97 Things Every Data Engineer Should Know", using the knowledge and context gained from running the show to inform the selection process. In this episode he shares some reflections on producing the podcast, compiling the book, and relevant trends in the ecosystem of data engineering. He also provides some advice for those who are early in their career of data engineering and looking to advance in their roles.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
- Your host is Tobias Macey and today I’m doing something a bit different. I’m going to talk about some of the lessons that I have learned while running the podcast, compiling the book "97 Things Every Data Engineer Should Know", and some of the themes that I’ve observed throughout.
Interview
- Introduction
- How did you get involved in the area of data management?
- Overview of the 97 things book
- How the project came about
- Goals of the book
- What are the paths into data engineering?
- What are some of the macroscopic themes in the industry?
- What are some of the microscopic details that are useful/necessary to succeed as a data engineer?
- What are some of the career/team/organizational details that are helpful for data engineers?
- What are the most interesting, innovative, or unexpected outcomes/feedback that I have seen from running the podcast and working on the book?
- What are the most interesting, unexpected, or challenging lessons that I have learned while working on the Data Engineering Podcast and 97 things book?
- What do I have planned for the future of the podcast?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- 97 Things Every Data Engineer Should Know
- Buy on Amazon (affiliate link)
- Read on O’Reilly Learning
- O’Reilly Learning 30 Day Free Trial
- Podcast.__init__
- Pipeline Academy data engineering bootcamp
- Hadoop
- Object Relational Mapper (ORM)
- Singer
- Airbyte
- Data Mesh
- Data Contracts Episode
- Designing Data Intensive Applications
- Data Council
- Data Engineering Weekly Newsletter
- Data Mesh Learning
- MLOps Community
- Analytics Engineering Newsletter
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy, and today I'm doing something a bit different. So instead of having a guest on the show, I'm actually going to be talking about some of the lessons that I've learned while running the podcast, my experience working on the book 97 Things Every Data Engineer Should Know, and some of the lessons and themes that I have observed throughout all of that. So I'm sure most of you know me as the host of the data engineering podcast, but just to give my official introduction that has probably been leaked out in bits and pieces over the various episodes over the past 5 years. But I am the host of this show. So I've been running the data engineering podcast since 2017.
Before that, I started the show podcast.init, focused on Python and and its ecosystem. And my background is actually in a combination of systems administration and software engineering. So I got my degree in computer engineering, which is sort of a hybrid of electronics and software. So electrical engineering meets computer science. And I started as a systems administrator and then moved into software engineering and then ended up settling kind of in the midpoint as a DevOps engineer. These days, I run the platform and DevOps team for open learning at MIT, which gives me an opportunity to work at the boundaries of software and systems and data. And so I'm actually working through the process of building out my own data platform and using a lot of the lessons that I've learned from the podcast and from my guests to make some architectural choices about how that will best work given our constraints and our operating environment and some of the goals that we have for that platform. So this podcast has served as a valuable learning experience for myself as I know it has for many of the people who listen to it.
And so my background in data management is actually through that work of being a systems administrator and software engineer and managing a lot of the data and persistence mechanisms that go into these various systems that I've worked on. Some of the most data heavy work that I've done was actually for a job where I was doing a lot of work with BigQuery, actually trying to use it as the persistence mechanism for a clickstream data flow, basically, where we had a homegrown API server that would take JavaScript events from a web application, push them into a Redis queue, and then batch those up into BigQuery so that we could then build the user facing analytics on the events that we were tracking. And so it really wasn't the best tool for the job, but we were able to make it mostly work. So a lot of the things that I learned about data came from that as far as useful patterns, antipatterns, choosing the right storage engine for the job. So definitely some useful lessons learned there. And that's also part of what fed into my interest in data engineering as a formal position and role and a lot of the sort of computer science and principles that goes into that. And so as I mentioned, I've been running the podcast for about 5 years now, and I think it was maybe 2 or 3 years ago at this point. So I had been running the show for about 3 years. I was approached by the folks at O'Reilly to see if I was interested in taking over a project that they were doing called 97 things every data engineer should know. And so that was actually an entry in the series that they've done with that general theme of 97 things every blank should know. So they've had programmers, cloud engineers, and in this case, data engineers. So they asked me if I was interested and it seemed to fit well with what I was doing with the podcast, so I took on the role. And the goal of the book is to be able to give people a good overview of some of the lessons and principles that go into data engineering and some of the things that you need to know as you're coming into the role or if you are already working as a data engineer, maybe some lessons that you haven't had the opportunity to come across on your own. And so it's a collection of short essays by folks who are working in the industry at various roles, sharing some of the tidbits that they've learned and trying to present it in a concise format to give you the idea, but not going deep. So each article is about a page and a half to 2 pages, and it covers various macroscopic and microscopic trends in the industry or details that you might want to know. And so I was working on communicating with folks who I've had on the podcast, asking if they were interested. I made some announcements both through the podcast and the email newsletter about the fact that I was working on this project to invite people to send me their posts to include.
And so it was about a year long project of communicating with people, collecting all these entries, and then reading them, sorting through them, figuring out which ones really kind of capture the essence of data engineering without having too much overlap. So it was an interesting project. Learned a lot of things in that process as well. And the finished book actually got released in the middle of fall of last year. So it's been out for a little while now. It's been getting some pretty good reception. I've had some folks reach out to me to say that they've read it and enjoyed it and been able to learn some things from it. And I just wanted to kind of recap and review some of the details that went into the book, some of the interesting lessons and trends and themes that I noticed as a result of that.
And so starting out with the question of what is a data engineer, how do you get into data engineering, some of the things that go into the book and also things that I've learned from the podcast are that there isn't any 1 path into data engineering, particularly a few years ago. These days, it's becoming a bit more direct where there are actually boot camps where folks will train you in some of the principles of data engineering because of the fact that there is such a growing need for people to be able to manage the wealth of data that organizations are collecting and trying to use for analytics and machine learning projects. And so there are some more formalized ideas of what data engineering requires. But in terms of the backgrounds of people who get into data engineering, it's everything from software engineers who are really interested in the data processing and the analytical aspects of it, so they want to get more into the systems level. There are infrastructure engineers who end up being tasked with handling some of the underlying storage engines and processing engines. So people who came from the Hadoop ecosystem of needing to be able to deploy and maintain these large clusters of machines or people who are working with Kafka or even just relational database engines of Postgres, etcetera.
And they are really interested in the reliability aspects of building these infrastructure components that power the analytical capabilities for their organizations. And so they decide that they want to move a layer up from the bare metal and the deployment of those systems into the actual operation of those systems. There are also a lot of folks who get into it from the data science side where they maybe come in as the 1st data scientist on a team or in an organization, and so they then end up having to build up all of the processing and cleaning before they can do what they were hired to do. And then there are also a number of data analysts who maybe start with the analytical knowledge of how the data is being used and then decide that they wanna get more into the details of how that data is collected and managed and cleaned.
And so there isn't any 1 path into data engineering and even more so than with software, it seems like folks are coming from a number of different backgrounds because data is very pervasive. And even if you're not dealing with data from a computational perspective, there is still a lot of interest and draw for folks who are curious about how the data can be used and what they can do with the data, and so they get pulled into this area. Some of the ideas that come into the book about getting into data engineering is some of the categorizations of types of data engineers. So there's 1 article higher level of understanding what are the questions that are being asked of the data and just treating everything as a declarative recipe for getting at that information, cleaning it up, preparing the data. And then there's the software focused data engineer where you're building more complex systems doing, doing detailed processing of the information, maybe feeding that data into other downstream systems, and building out these complex pipelines to be able to work with machine learning use cases or analytical use cases. And so there are different requirements and backgrounds that feed into both of those, so that was a very interesting way of kind of dividing the types of data engineers.
Some of the other things that go into the kind of career aspect of working in data engineering is the idea of treating the data as the focal point of the process and not trying to put as much emphasis on the actual technical and software components of the system. So working with the other folks on your team and across your organization to help them understand all of the different ways that the data is being used, the different processing that's happening on that and feeding back some of the usage patterns, so exposing that so that people can see, oh, okay. This person in this department is using this set of tables to be able to answer their questions. Maybe I can take advantage of that. So building a community in your organization around the data and how it's used and not spending so much of your emphasis on the technical elements of what is doing the processing and how the data is being provided. So that's another aspect of data engineering and the data ecosystem that is, I think, unique in the space is that there's a lot more cross cutting concerns that go into it. So working as a software engineer, you are building an application and so there's a much more contained aspect to it where you might have a product team that provides you with the requirements or the feature requests.
But as a software engineer, you can live in this entire ecosystem of the application. You know, with microservices, it's maybe multiple applications, but it's still a software system. With data, you need to have folks who are spanning the technical elements of how is the data generated, how is it collected, how is it stored, how is it processed, how is it managed, and then the analysts who are trying to understand the context around the data. So how did this data get produced? Who produced it? When was it produced? Why was it produced? What is the goal of this dataset? How can I use this to answer questions that the business people are asking me? And then you have the people in the business, whether that's the c suite executives who are trying to figure out what direction to take the business or salespeople who are trying to understand what are the patterns of our industry, how can I understand more about the customers that I'm working with, similar with marketing? And so you have all of these people that are all oriented around the data because of the fact that so many organizations rely on this to be able to know what is actually happening in the business because businesses are becoming more complex, more multifaceted. There's more data available to work with, and so everybody needs to have their hands in this process. And so as data engineers, we need to be able to provide the information that these people need beyond just the raw data points. We need to be building an ecosystem of the data and the context and helping people be able to answer these meta questions beyond just how many widgets did I sell this past quarter, but who did I sell them to, what were the sort of cycles of sales, what were some of the other kind of macroscopic and macroeconomic elements that were going on. So there's no real stopping point with data. There's never with sometimes with software systems, a problem can be well scoped and you say, okay. I've written all the software. This project is done. You know, it does everything it's supposed to do barring bit rot. There's not really anything that needs to be done about it anymore. With data, there are always gonna be additional follow on questions. So there are always gonna be opportunities to bring in additional data sources, add additional context, enrich the data, find new ways of accessing the data. So that's another thing is that you might put your data into a data warehouse because that serves the need of your analysts, but then maybe for your machine learning engineers, you need to also have it available in a data lake to be able to pull in unstructured data, or maybe you need to merge data across multiple different storage locations. So there are a lot of complexities that come into the space.
And so moving into some of the macro scopic elements of data engineering and the data industry and some of the things that are discussed in the book are the continued dichotomy of batch systems and streaming systems where for a long time, we didn't have a lot of the capacity to do large scale streaming analytics because the technology wasn't there yet, but that is increasingly not the case where we have a number of different streaming engines. But batch systems are still much easier to reason about. They're more intuitive to think about how it works. And so there is still this trade off of complexity in terms of the technologies and in terms of the paradigms of do I want to build this in a batch mechanism where maybe I'm gonna put everything into the data warehouse, or do I want to do this in a streaming approach where I want to be able to have continually updated real time information about a particular aspect of the business or the customer engagement.
And so there are some articles that talk about when to use batch, when to use streaming, there are articles that dig into some of the specifics of streaming messaging patterns, so talking about the data contracts and making sure that you have schematized elements so that everybody who is working with the data knows what the structure is going to look like, knows how they're able to use it, and then being able to have being able to have that data land in a data lake or a data warehouse and have the appropriate structure around it. Because with data lakes, they're definitely very useful because they can be very flexible, but that flexibility also adds a certain amount of extra upfront requirements to make sure that you know what the data is going to be used for and why so that it can be structured appropriately. Otherwise, it becomes a dumping ground and becomes useless. So you need to have data cataloging. So that's another macroscopic trend that's been coming up a lot in recent years is building out data catalogs, building out metadata management so that there's this discoverability and visibility element of the data that you're working with, and that actually spans data lakes and data warehouses.
On the batch side of things, the cloud data warehouses have been seeing a lot of activity and attention because of the expanded capabilities, because of the additional flexibility in terms of the processing and the cost models where data warehouses used to be these large appliances that were put into a data center and you had to understand what is my capacity going to be for the next 5 years so that I know I'm buying enough hardware to handle the maximum use case for however long this is going to be in service for. Whereas now, it's a pay as you go model where you can say, I'm going to start with a small data mart. Maybe I'm going to use a snowflake or a redshift or a bigquery and I can add capacity dynamically as I go. I don't need to pre provision all of that.
So 1 of there's actually a great article that talks about the contributing factors for understanding what data warehouse to use. So what are you going to need it for? How long is it going to be in service? Because there is a substantial switching cost if you say, okay. Today I'm going to try out Snowflake and then 6 months from now I realize that it's actually not the right fit for my business and now I need to migrate to query. So there's a substantial cost both in terms of engineering time but also in terms of the actual financial cost of migrating your data between these systems. So 1 of the main themes that I've seen both in the book and in the podcast is really this importance of having a good upfront understanding of the use cases that you're trying to power and the design that will provide that. So with software systems such as a web application, you could be very iterative in the development where you say, okay. I'm going to start with this 1 form that I can use as input and then I'm going to add an additional functionality where next week maybe I'll add a dashboard that shows all the inputs to this form. Whereas when you're working with large data systems and complex data systems, you need to have a much more detailed understanding of what is the data that I'm collecting, who is going to be using that data, what transformations or processing do I need to do on that data, what is the format and the storage location?
What are the boundaries in terms of the technical as well as the organizational conditions of when this information is going to be moving either between systems or between teams, so you need to have these well defined specifications of how the data is going to be stored and used so that you don't end up having to reengineer things and spend a lot of time and effort after you've made the discovery where you say, maybe I'm going to put everything into a data lake as a bunch of JSON files. I'm not going to enforce the structure upfront because I'm still exploring the problem space. And that might be okay for a very small well scoped discovery period, but if you let that go on too long, then you're going to end up having to spend a lot of time and effort once you do understand what your use cases are going to be, writing pipelines that will actually reprocess all of that data, figure out how to handle mismatched schemas, mismatched data types, and do a lot of extra cleaning that could have been avoided if you spent a bit more time at the beginning having conversations with people in the business and people in your team about what the use cases are that you're trying to power. So there's a lot of importance in this very deliberate approach to designing and implementing your systems.
It's possible to do it iteratively and ad hoc, but it's going to cost you a lot of extra time down the road. StreamSets' DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud architectures. Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change. Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors, and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.
Finally, 1 single pane of glass for operating and monitoring all of your data pipelines, the full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at data engineering podcast.com/streamsets. The first 10 listeners that subscribe to stream sets professional tier will receive 2 months free after their 1st month. Some of the microscopic details or some of the implementation specifics that are interesting and some useful articles in the book are maybe talking about the ways that the data is actually laid out on disk and how that can impact processing times and latencies and efficiency of your workflows. So there are some types of systems that actually do very well with lots of small files, but a lot of times, it's actually much better to have a fewer number of larger files. So figuring out what is the appropriate granularity to trade off processing efficiency with latency or processing efficiency with scale of the data. There are also some useful lessons about some patterns to use for working with distributed messaging systems such as Kafka or Pulsar and how to think about structuring your data streams in a way that downstream consumers can make effective use of them without having to do a bunch of reengineering of your events. There are also some good lessons about the specifics of different data persistence models such as relational engines versus non relational engines, so SQL versus NoSQL if you will, as well as, strongly consistent versus eventually consistent systems and what are the trade offs there of strong consistency being that I know that when I write this data, I can read it back immediately and know that it will have the same information versus if I have a system that has a high volume of rights.
I need to make sure that I can accept all of those rights, but I don't care necessarily about being able to immediately read those back consistently. I I'm okay with them being eventually consistent. I know that eventually everything will coalesce to a steady state, but in the point where I'm writing all of this data, I'm okay with having to do some reconciliation afterwards. There are also some useful articles talking about some of the considerations that go into defining and implementing these data contracts that I was referring to of with the upfront design that's necessary.
So 1 of them actually has some very useful advice about using a versioned schema for being able to enforce those contracts. So using something such as or protocol buffers that are used in the producing system to ensure that events are only ever to be able to be admitted in a well formed and, schema defined manner and being able to have those schemas evolvable so that you can ensure backwards compatibility with older events as you're working through these systems. Another interesting both microscopic detail in terms of how it's implemented, but macroscopic trend that I'm seeing is the the recognition that building manageable and scalable data systems is the responsibility of not just the data engineers, but also application engineers. So starting to push some of these principles of well structured events and properly formatted data into the application layer so that as data engineers, we are not tasked with getting a direct connection to an application database, trying to figure out what are all these tables supposed to be telling me, understanding how those tables might be related to each other, particularly if you're dealing with an application that's using an object relational mapper that might not actually create the appropriate foreign key references in the database schema because it's implied in the code. So having to dig into the guts of these databases and manage the schema evolutions that the application engineers need, but you don't necessarily understand as the consumer of that data, why did this table lose a column or why was this column renamed or what is the new requirement of whether this column is null or not null or sometimes null. And instead saying to the application developers that it is part of the requirement of your application is that you are going to provide me with an interface to consume the data that I should care about. So I don't know what's going into this application. I don't necessarily know how it works, but you as the application engineer understand what are the important pieces of information, what are the ways that this data might be useful outside of the context of the application and so providing an interface to be able to consume that data in a stable manner. So using these protocol buffers or Avro schemas to create these APIs so that I can just query a stable endpoint to get this data out, maybe using something like a singer tap or an airbyte consumer to be able to pull data out of this application without having to go into the database layer to get it. So I think that that's really a valuable evolution of the industry where data is becoming a first class consideration, where analytical uses of data is becoming a first class consideration in the structure of the applications that we build so that we don't have to do this reengineering every time we wanna pull something out of some system that produces the data in the first place. Another element of the kind of data contracts and the evolution of ways that data is transformed and produced is the data mesh. So there have been a number of data mesh as a way to say this is the bounded context of where this data is being produced. This is the contract that I'm going to provide of what this data looks like and some of the semantics around it so that downstream consumers can use it without having to do a bunch of reprocessing.
So I think that that's another interesting and useful evolution of the industry and some things that are covered in the book as well. And I think that that also factors into a few of the organizational trends where there are the growth of analytics engineers as a recognized role where they're working more closely with the business to be able to produce these analyses, but also being responsible for handling the cleaning and transformations of the data so that they can access it in the ways that are most conducive for answering the questions that they're being asked, as well as the growth of machine learning engineers as people who are dedicated to building the machine learning operations and making these machine learning workflows more repeatable. And just in general, there's been a lot more specialization because of the growth of data. But at the same time, as we get more sophisticated, I'm seeing a bit of a reverse trend also where some organizations say we don't actually need specialized definitions of roles for data infrastructure engineer, data engineer, machine learning engineer, ML ops engineer, analytics engineer. We just need engineers who understand data, and so it's becoming maybe a bit more homogeneous.
So there's kind of this interesting dichotomy of specialization and generalization, and it really depends on where you are in your career, where your where your organization is is in its maturity level, and some of the ways that you're using data. So I've definitely seen folks who are starting to move in the opposite direction of saying, we just have engineers. It's everybody's job to care about the data, and that's part of what the data mesh trend is moving towards. But also people who are saying, we're using data all over the place, so we need analytics engineers and MLOps engineers and data engineers, etcetera, and data platform engineers and data infrastructure engineers.
So interesting times for it to be sure. And in all of that, 1 of the things that has remained the same and continued to gain an importance is actually the role of automation in making all of this manageable where we have data pipelines that are tasked with automating, you know, the extract and the loading and the transformation of data. There is automation infrastructure layer of being able to dynamically scale capacity, whether that's through a vendor such as Snowflake or if you're using infrastructure as code for your cloud resources or being able to manage public cloud versus private cloud versus hybrid cloud. So it's it's really a rapidly expanding and constantly fractal space that we're working in.
And so there's definitely a lot to know, but at the same time, you can get a lot done with knowing just enough. So I definitely don't want folks to be overwhelmed with everything that's happening and feeling like you're never going to know everything because none of us know everything. Nobody's ever going to know all there is to know about data. But as long as you're able to grasp the kind of foundational and fundamental principles, you'll be able to figure out the rest of it as you go. So you really just need to face you really need to say to yourself, what is it that will help me today or tomorrow with being able to be more effective at handling this 1 piece of the way that data is used either for my own personal projects or in my organization, and then using that as an opportunity to learn a little bit more. So definitely don't try to learn everything all at once, but instead try to practice the principle of just in time learning of, I know that I need to get this done today or tomorrow, so I'm going to learn a bit more about this. And then that will open up new avenues to explore as you go, and you'll always be able to make forward progress.
In terms of my work on the podcast than on the book, I started this podcast partly because I just wanted to learn more about the space. So I, as I said at the beginning, have experiences in software development and systems automation and, systems administration. And I knew some of the principles of data engineering because I had built some pipelines, but I really wanted to have the opportunity to learn from the experts and help share that knowledge through the podcast. And so I've had a really great opportunity to be able to make that happen. And in the past 5 years, I've gained a lot of knowledge personally. And as a result, I've been started to be viewed as an expert in the field because of the podcast.
And so some of the most interesting and innovative and unexpected outcomes that I've seen from running the podcast and working on the book and some of the feedback that I've gotten is that I never thought that I would be this far along in my journey as a data engineer and that I would have been able to prove provided so much information and knowledge to the community and that the podcast would ever get to be as popular as it is now. I've actually had some folks who have written to me to say that they were interested in data engineering. And through listening to the podcast and learning the lessons that way, they were actually able to break into the industry and get a job as a data engineer because of the things that they learned through the podcast. I've also had people who write to tell me that they've been working in data engineering or maybe they manage a data team. And because of the things that they've heard about in the podcast or lessons that they've learned, they've been able to make substantial improvements to the way that they manage data in their organization. So it's definitely very humbling and gratifying to have been able to provide such a resource to so many people. In terms of the sort of interesting and unexpected and challenging lessons that I've learned while working on the podcast and the 97 Things book is that I didn't know at the outset how much detail there is and how wide this ecosystem can be with so many different considerations ranging from, you know, just the pipeline design, ETL, databases, you know, data lakes into metadata management, data governance, data privacy, data security.
So I've definitely learned a lot there and I've also learned a lot about how to kind of evaluate the utility of different products or services that are out there because I've had so many people reach out to me saying that they wanna be on the show to talk about various things, or I've come across different things in my own work to try and understand is this useful or maybe this is something that's worth exploring on the podcast, but really being able to have a useful application of kind of judicious skepticism. So not taking claims at face value, but knowing how to look to see how well is this particular vendor or product addressing these fundamental foundational elements of data challenges or organizational challenges?
And when is it something new and novel versus when is it just repackaging of something that's already been done before? So how do I figure out whether to talk about 5 different data quality tools because they're all taking it in a different direction versus only talking to 1 or 2 of them because there's only 1 or 2 kind of novel ideas in the space. There's definitely a lot of novel ideas in the space of data quality. So I don't wanna suggest that it's not nuanced and detailed, but just using that as a sort of top of mind example. And I was also very grateful for the opportunity to work on the 97 things book both as a way to have the opportunity to reach out to folks in the community to get some more detail from them about things that they're working on as well as a way to provide a resource that people can use to get a broad surface level view of a lot of the things that are happening in the space.
So for anybody who is interested in breaking into the industry or learning more about some of the foundational principles or figuring out what are the topics that are worth exploring, I think the book does a good job of that. And I'll also say that if you really wanna get a solid foundational introduction to a lot of the kind of computer science and systems design principles that go into all the systems that we work on. I highly recommend reading the book Designing Data Intensive Applications because it is a fantastic resource that has an appropriate level of detail on all of the different computer science and distributed systems concepts that go into the things that we're working on.
There are lots of other great resources out there. So there are, some useful medium blogs. Definitely recommend checking out what the folks at Data Council are doing. They've got a great community. You know, they they do a lot to help further the ecosystem and work with both open source projects and newcomers to the community as well as helping businesses kind of reach their audience and understand what are the challenges that people in the trenches are going through every day. And there are also some other communities that are starting to grow up. There's 1 that's growing up around data mesh. So I think there's, it's called data mesh learning.
There's an MLOps community that's been gaining a lot of ground. So I'll add links to all these in the show notes. And so definitely just wanna thank everybody for giving me your attention, taking the time to listen to the shows that I put out. Very happy that I've been able to provide something that is useful. And so with that, I'm going to add my contact info to the show notes. It's already on the website but for folks who maybe wanna follow-up. Definitely, if you liked this format of having me go on a monologue and talk about some of the things that I've been seeing, let me know, if there are any particular topics that you want me to dig deep into from what I've seen or what I've worked on, just send me a message. I'm happy to try this out again. And so as my final question to myself, I'd like to share my perspective on what I see as being the biggest gap in the tooling or technology that's available for data management today. And I think that right now, it's actually in this space that I mentioned earlier of the kind of split between software and application development and data engineering and data platforms where we're starting to evolve to this space where data and stable interfaces are a first class consideration of applications, but the tooling is not quite there to make that easy for application developers to be able to say, okay, this is what I want to expose where we have good resources to define database models or to define APIs, but it's not straightforward and out of the box to say, okay. These are the things that you need to know about how to build an API that is useful for pulling out analytical data, whether it's in batch format or being able to do event publication so that I can maybe feed that into an event bus or being able to do some, like, change data capture style approach from the stable interface so that I can just, as a data engineer or as a data platform, consume those events incrementally.
So there's definitely a lot of opportunity to be able to build tooling and systems that make it easier for people who are working with these, you know, web frameworks or application frameworks to create these analytical interfaces without having to reengineer it from scratch every time. And so with that, I definitely wanna thank everybody for listening. I'll add some links to the show notes with useful references. Appreciate everything that everybody has done to help get this show to where it is today. And thank you, and have a great rest of your day. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Host Background
Lessons from Running the Podcast
97 Things Every Data Engineer Should Know
Pathways into Data Engineering
Batch vs. Streaming Systems
Data Catalogs and Metadata Management
Importance of Upfront Design
Implementation Specifics and Data Contracts
Data Mesh and Organizational Trends
Role of Automation in Data Management
Unexpected Lessons from the Podcast and Book
Recommended Resources
Biggest Gap in Data Management Tooling
Closing Remarks