A Reflection On Learning A Lot More Than 97 Things Every Data Engineer Should Know

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy, and today I'm doing something a bit different. So instead of having a guest on the show, I'm actually going to be talking about some of the lessons that I've learned while running the podcast, my experience working on the book 97 Things Every Data Engineer Should Know, and some of the lessons and themes that I have observed throughout all of that.

So I'm sure most of you know me as the host of the data engineering podcast, but just to give my official introduction that has probably been

leaked out in bits and pieces over the various episodes over the past 5 years. But I

am the host of this show. So I've been running the data engineering podcast since 2017.

Before that, I started the show podcast.init,

focused on Python and and its ecosystem.

And my background is actually in a combination of systems administration

and software engineering.

So I got my degree in computer engineering, which is sort of a hybrid of

electronics

and software.

So electrical engineering meets computer science.

And I started as a systems administrator

and then moved into software engineering and then ended up settling kind of in the midpoint as a DevOps engineer.

These days, I run the platform and DevOps team for open learning at MIT, which gives me an opportunity to work at the boundaries of software and systems and data. And so I'm actually

working through the process of building out my own data platform and using a lot of the lessons that I've learned

from the podcast and from my guests to make some

architectural choices about how that will best work given our constraints and our operating environment

and some of the goals that we have for that platform. So this podcast has served as a valuable learning

experience for myself as I know it has for many of the people who listen to it.

And so my background in data management

is actually through that work of being a systems administrator and software engineer

and managing

a lot of the data and persistence mechanisms that go into these various systems that I've worked on. Some of the most data heavy work that I've done was actually for a job where I was doing a lot of work with BigQuery,

actually trying to use it as the persistence mechanism for a clickstream

data flow, basically, where we had a homegrown

API server that would take JavaScript events from a web application,

push them into a Redis queue, and then batch those up into BigQuery so that we could then build the user facing analytics

on the events that we were tracking. And so it really wasn't the best tool for the job, but we were able to make it mostly work. So a lot of the things that I learned about data came from that as far as useful patterns, antipatterns,

choosing the right storage engine for the job. So

definitely some useful lessons learned there. And that's also part of what fed into my interest in data engineering as a formal

position and role and a lot of the sort of computer science and principles that goes into that. And so as I mentioned, I've been running the podcast for about 5 years now, and I think it was maybe 2 or 3 years ago at this point. So I had been running the show for about 3 years. I was approached by the folks at O'Reilly to see if I was interested in taking over a project that they were doing called 97 things every data engineer should know. And so that was actually

an entry in the series that they've

done with that general theme of 97 things every blank should know. So they've had programmers,

cloud engineers,

and in this case, data engineers. So they asked me if I was interested and it seemed to fit well with what I was doing with the podcast, so I took on the role. And

the goal of the book is

to be able to give people a good overview of some of the lessons and principles that go into data engineering and some of the things that you need to know as you're coming into the role or if you are already working as a data engineer, maybe some lessons that you haven't had the opportunity to come across on your own. And so it's a collection of short essays by folks who are working in the industry at various roles, sharing some of the tidbits that they've learned and trying to present it in a concise format to give you the idea, but not going deep. So each article is

about a page and a half to 2 pages, and it covers

various macroscopic

and microscopic

trends in the industry or details that you might want to know. And

so I was

working on

communicating with folks who I've had on the podcast, asking if they were interested. I made some announcements both through the podcast and the email newsletter about the fact that I was working on this project to invite people to send me

their posts to include.

And so it was about a year long project of communicating with people, collecting all these entries, and then

reading them, sorting through them, figuring out which ones really kind of capture the essence of data engineering without having too much overlap. So it was an interesting project.

Learned a lot of things in that process as well.

And the finished book actually got released

in

the middle of fall of last year. So it's been out for a little while now. It's been getting some pretty good reception. I've had some folks reach out to me to say that they've read it and enjoyed it and been able to learn some things from it. And I just wanted to

kind of recap and review

some of the details that went into the book, some of the interesting lessons and trends and themes that I noticed as a result of that.

And so starting out with the question of

what is a data engineer, how do you get into data engineering,

some of the things that go into the book and also things that I've learned from the podcast are that there isn't any 1 path into data engineering,

particularly

a few years ago. These days, it's becoming

a bit more direct where there are actually

boot camps where folks will train you in some of the principles of data engineering because of the fact that there is such a growing need for people to be able to manage the wealth of data that organizations are collecting and trying to use for analytics and machine learning projects.

And so

there are some more formalized ideas of what data engineering requires. But in terms of the backgrounds of people who get into data engineering, it's everything

from software engineers who are really interested in the data processing and the analytical aspects

of it, so they want to get more into the systems level. There are infrastructure engineers who end up being tasked with handling some of the underlying

storage engines and processing engines. So people who came from the Hadoop ecosystem of needing to be able to deploy and maintain these large clusters of machines or people who are working with Kafka or even just relational database engines of Postgres, etcetera.

And they

are really interested in

the reliability

aspects of building these

infrastructure components that power the analytical capabilities for their organizations.

And so they decide that they want to move a layer up from the bare metal and the deployment of those systems into the actual operation of those systems.

There are also a lot of folks who get into it from the data science side where they maybe come in as the 1st data scientist

on a team or in an organization,

and so they then end up having to build up all of the processing and cleaning before they can do what they were hired to do. And then there are also a number

of data analysts who maybe start with the analytical knowledge of how the data is being used and then decide that they wanna get more into the details of how that data is collected and managed and cleaned.

And so there isn't any 1 path into data engineering

and even more so than with software, it seems like folks are coming from a number of different backgrounds

because data is very pervasive. And even if you're not dealing with data from a computational

perspective, there is still a lot of interest and

draw for folks who are curious about how the data can be used and what they can do with the data, and so they get pulled into this area. Some of the ideas that come into the book about getting into data engineering

is some of the categorizations of types of data engineers. So there's 1 article

higher level of understanding

what are

the questions that are being asked of the data

and just treating everything as a declarative recipe for getting at that information, cleaning it up, preparing the data. And then there's the software focused data engineer where you're building more

complex systems

doing,

doing detailed processing of the information,

maybe feeding that data into other downstream

systems, and building out these complex pipelines to be able to work with machine learning use cases or analytical use cases. And so there are different requirements and backgrounds that feed into both of those, so that was a very interesting way of kind of dividing the types of data engineers.

Some of the other things that go into the kind of career aspect of working in data engineering

is the idea of treating the data as the

focal point of the process and not trying to put as much emphasis on the actual technical and software components of the system.

So working with the other folks on your team and across your organization

to help them

understand

all of the different ways that the data is being used, the different processing that's happening on that and feeding back some of the usage patterns, so exposing that so that people can see, oh, okay. This person in this department is using this set of tables to be able to answer their questions. Maybe I can take advantage of that. So building a community in your organization

around the data and how it's used and not spending so much of your emphasis on the technical elements of what is doing the processing and how the data is being provided. So that's another aspect

of data engineering and the data ecosystem

that is, I think, unique in the space is that there's

a lot more cross cutting concerns that go into it. So working as a software engineer,

you are building an application and so there's a much more

contained

aspect to it where you might have a product team that provides you with the requirements or the feature requests.

But as a software engineer, you can live in this entire ecosystem of the application.

You know, with microservices, it's maybe multiple applications, but it's still a software system. With

data, you need to have folks who are spanning the technical elements of how is the data generated, how is it collected, how is it stored, how is it processed, how is it managed, and then the analysts who are trying to understand the context around the data. So how did this data get produced? Who produced it? When was it produced? Why was it produced? What is the goal of this dataset?

How can I use this to answer questions that the business people are asking me? And then you have the people in the business, whether that's the c suite executives who are trying to figure out what direction to take the business or salespeople who are trying to understand

what are the

patterns of our industry, how can I understand more about the customers that I'm working with, similar with marketing? And so you have all of these people that are all oriented around the data because of the fact that so many organizations rely on this to be able to

know what is actually happening in the business because businesses are becoming more complex, more multifaceted. There's more data available to work with, and so everybody needs to have their hands in this process. And so as data engineers, we need to

be able to

provide the information that these people need beyond just the raw data points. We need to be building an ecosystem

of

the data and the context and helping people be able to answer these meta questions beyond just how many widgets did I sell this past quarter, but who did I sell them to, what were the sort of cycles of sales,

what were some of the other

kind of macroscopic and macroeconomic

elements that were going on. So there's no real stopping point with data. There's never with sometimes with software systems, a problem can be well scoped and you say, okay. I've written all the software. This project is done. You know, it does everything it's supposed to do

barring bit rot. There's not really anything that needs to be done about it anymore. With data,

there are always gonna be additional follow on questions. So there are always gonna be opportunities to bring in additional data sources, add additional context, enrich the data, find new ways

of accessing the data. So that's another thing is that you might put your data

into a data warehouse because that serves the need of your analysts, but then

maybe for your machine learning engineers, you need to also have it available

in a data lake to be able to pull in unstructured data, or maybe you need to merge data across multiple different storage locations. So there are a lot of complexities that come into the space.

And so moving into some of the macro scopic elements of data engineering and the data industry

and some of the things that are discussed in the book are the

continued

dichotomy of batch systems and streaming systems where for a long time,

we didn't have a lot of the capacity to do large scale streaming analytics because the technology wasn't there yet, but that is increasingly not the case where we have a number of different streaming engines.

But batch systems are still much easier to reason about. They're more intuitive to think about how it works.

And so there is still this trade off

of complexity

in terms of the technologies and in terms of the paradigms of do I want to build this in a batch mechanism where maybe I'm gonna put everything into the data warehouse, or do I want to do this in a streaming approach where I want to be able to have continually updated real time information about a particular

aspect of the business or

the customer engagement.

And so there are some articles that talk about when to use batch, when to use streaming, there are articles that dig into some of the specifics

of streaming messaging patterns, so talking

about the data contracts and making sure that you have schematized elements so that everybody who is working with the data

knows what the structure is going to look like, knows how they're able to use it,

and then being able to have

being able to have that data land in a data lake or a data warehouse and have the appropriate structure around it. Because

with data lakes,

they're definitely very useful because they can be very flexible, but that flexibility also

adds a certain amount of extra

upfront requirements

to make sure that you

know what the data is going to be used for and why so that it can be structured appropriately. Otherwise, it becomes a dumping ground and becomes useless. So you need to have data cataloging. So that's another macroscopic trend that's been coming up a lot in recent years is building out data catalogs, building out metadata management so that there's this discoverability

and visibility element of the data that you're working with, and that actually spans data lakes and data warehouses.

On the batch side of things,

the cloud data warehouses

have been seeing a lot of

activity and attention

because of the

expanded

capabilities,

because of the

additional flexibility

in terms of the processing and the cost models where

data warehouses used to be these

large appliances that were put into a data center and you had to

understand what is my capacity going to be for the next 5 years so that I know I'm buying enough hardware to handle the maximum use case

for however long this is going to be in service for. Whereas now, it's a pay as you go model where you can say, I'm going to start with a small data mart. Maybe I'm going to use a snowflake or a redshift or a bigquery

and I can add capacity dynamically as I go. I don't need to pre provision all of that.

So 1 of there's actually a great article

that talks about the

contributing factors

for understanding

what data warehouse to use. So what are you going to need it for?

How long is it going to be in service? Because there is a substantial switching cost if you say, okay. Today I'm going to try out Snowflake

and then 6 months from now I realize that it's actually not the right fit for my business and now I need to migrate to query.

So there's a substantial cost both in terms of engineering time but also

in terms of the actual financial cost of migrating your data between these

systems. So 1 of the main

themes that I've seen both in the book and in the podcast is really this importance of having

a good upfront

understanding

of the use cases that you're trying to power and

the design that will provide that. So with software systems such as a web application, you could be very iterative in the development where you say, okay. I'm going to start with this 1 form that I can use as input and then I'm going to add an additional functionality where next week maybe I'll add a dashboard that shows all the inputs to this form. Whereas when you're working

with

large data systems and complex data systems, you need to have a much more detailed understanding

of what is the data that I'm collecting,

who is going to be using that data,

what transformations or processing do I need to do on that data, what is the format and the storage location?

What are

the boundaries

in terms of the

technical as well as the organizational

conditions of when this information is going to be moving either between systems or between teams, so you need to have these well defined

specifications

of

how the data is going to be

stored and used so that you don't end up having to reengineer things and spend a lot of time and effort after you've made the discovery

where you say, maybe I'm going to put everything into a data lake as a bunch of JSON files. I'm not going to enforce the structure upfront because I'm still exploring the problem space.

And that might be okay

for a very small well scoped discovery period, but if you let that go on too long, then you're going to end up having to spend a lot of time and effort

once you do understand what your use cases are going to be, writing pipelines that will actually

reprocess all of that data, figure out how to handle

mismatched schemas, mismatched data types, and do a lot of extra cleaning

that could have been avoided if you spent a bit more time at the beginning

having conversations

with people in the business and people in your team

about

what the use cases are that you're trying to power. So there's a lot of importance

in this very deliberate approach to

designing and implementing your systems.

It's possible to do it iteratively and ad hoc, but it's going to cost you a lot of extra time down the road.

StreamSets' DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud architectures.

Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change.

Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors, and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming.

Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.

Finally, 1 single pane of glass for operating and monitoring all of your data pipelines, the full transparency and control you desire for your data operations.

Get started building pipelines in minutes for free at data engineering podcast.com/streamsets.

The first 10 listeners that subscribe to stream sets professional tier will receive 2 months free after their 1st month.

Some of the

microscopic

details or some of the implementation

specifics

that are interesting and some useful articles in the book are maybe talking about the

ways that the data is actually laid out on disk and how that can

impact processing times and latencies

and efficiency of your workflows. So there are some types of systems that actually do very well with lots of small files,

but a lot of times, it's actually much better to have a fewer number of larger files. So figuring out what is the appropriate granularity

to trade off processing efficiency

with latency or processing efficiency

with

scale of the data. There are also some

useful lessons about some

patterns to use for working with distributed messaging systems such as Kafka or Pulsar

and how to think about structuring

your data streams in a way that downstream consumers

can

make effective use of them

without having to do a bunch of reengineering

of your events. There are also some good lessons about the specifics

of different

data persistence models such as relational engines versus

non relational engines, so SQL versus NoSQL if you will, as well as,

strongly consistent versus eventually consistent systems and what are the trade offs there of strong consistency

being that I know that when I write this data, I can read it back immediately and know that it will have the same information

versus

if I have a system that has a high volume of rights.

I need to make sure that I can accept all of those rights, but I don't care necessarily about being able to immediately read those back consistently.

I I'm okay with them being eventually consistent. I know that eventually everything will coalesce to a steady state, but in the point where I'm writing all of this data, I'm okay with having to do some reconciliation

afterwards.

There are also some

useful articles talking about some of the considerations that go into

defining and implementing these data contracts

that I was referring to of with the upfront design that's necessary.

So 1 of them actually has some very useful advice about

using

a versioned

schema for being able to enforce those contracts. So using something such as

or protocol buffers that are used in the

producing

system

to ensure that events are only ever to be able to be admitted in a well formed and,

schema defined manner

and being able to have those schemas evolvable so that you can ensure backwards compatibility

with older events as you're working through these systems.

Another

interesting both microscopic detail in terms of how it's implemented, but macroscopic

trend that I'm seeing is

the the recognition

that

building

manageable and scalable data systems

is the responsibility

of not just the data engineers, but also application engineers. So starting to push some of these

principles

of

well structured events and properly formatted data into

the application layer so that as data engineers, we are not tasked with

getting a direct connection to an application database,

trying to figure out what are all these tables supposed to be telling me, understanding how those tables might be related to each other, particularly if you're dealing with an application that's using an object relational mapper that might not actually create the appropriate foreign key references

in the database schema because it's implied in the code. So having to dig into the guts of these databases and manage

the schema evolutions that the application engineers need, but you don't necessarily

understand as the consumer of that data, why did this table lose a column or why was this column renamed or what is the new requirement of whether this column is null or not null or sometimes null. And instead saying to the application developers that it is part of the requirement of your application is that you are going to provide me with an interface to

consume the data that I should care about. So I don't know what's going into this application. I don't necessarily know how it works, but you as the application engineer understand what are the important pieces of information, what are the ways that this data might be useful outside of the context of the application and so providing an interface to be able to

consume that data in a stable manner. So using these protocol buffers or Avro schemas to create these APIs so that I can just query a stable endpoint to get this data out, maybe using something like a

singer tap or an airbyte

consumer to be able to pull data out of this application without having to go into the database layer to get it. So I think that that's really

a valuable

evolution

of the industry

where data is becoming a first class consideration,

where analytical uses of data is becoming a first class consideration

in the structure of the applications that we build so that we don't have to do this reengineering

every time we wanna pull something out of

some system that produces the data in the first place. Another element of the kind of data contracts

and the

evolution of ways that data

is transformed and produced is the data mesh. So there have been a number of

data mesh as a way to say this is the bounded context

of where this data is being produced. This is the contract that I'm going to provide of what this data looks like and some of the semantics around it so that downstream consumers can use it without having to do a bunch of reprocessing.

So I think that that's another interesting

and useful evolution of the industry and some things that are covered in the book as well. And I think that that also factors into

a few of the

organizational

trends where there are the growth of analytics engineers

as a

recognized

role where they're working more closely

with the business to be able to produce these analyses, but also being responsible for handling the cleaning and transformations of the data so that they can access it in the ways that are most conducive for answering the questions that they're being asked, as well as the growth of machine learning engineers as people who are dedicated to building the machine learning operations and making these machine learning workflows

more repeatable. And just in general, there's been a lot more

specialization because of the growth of data. But at the same time,

as we get more sophisticated,

I'm seeing a bit of a reverse trend also where some organizations say we don't actually need specialized definitions of roles for data infrastructure engineer, data engineer, machine learning engineer, ML ops engineer,

analytics engineer. We just need engineers who understand data, and so it's becoming maybe a bit more homogeneous.

So there's kind of this

interesting

dichotomy of

specialization and generalization,

and it really depends on where you are in your career, where your where your organization is is in its maturity level,

and some of the ways that you're using data. So I've definitely seen folks who are starting to move in the opposite direction of saying, we just have engineers. It's everybody's job to care about the data, and that's part of what the data mesh trend is moving towards. But also people who are saying, we're using data all over the place, so we need analytics engineers and MLOps engineers and data engineers,

etcetera, and data platform engineers and data infrastructure engineers.

So

interesting times for it to be sure.

And in all of that, 1 of the

things that has remained the same and continued to gain an importance is actually the role of automation

in making all of this manageable

where we have data pipelines that are tasked with automating,

you know, the extract and the loading and the transformation

of data. There is automation

infrastructure layer of being able to dynamically

scale capacity, whether that's through a vendor such as Snowflake or if you're using infrastructure

as code for your cloud resources or being able to manage

public cloud versus private cloud versus hybrid cloud. So

it's it's really a rapidly expanding and constantly fractal space that we're working in.

And so there's definitely

a lot to know,

but at the same time, you can get a lot done with knowing just enough. So I definitely don't want folks to be overwhelmed with everything that's happening and feeling like you're never going to know everything because

none of us know everything. Nobody's ever going to know all there is to know about data. But as long as you're able to grasp the kind of foundational and fundamental principles,

you'll be able to figure out the rest of it as you go. So you really just need to face you really need to say to yourself,

what is it that will help me

today or tomorrow

with being able to be more effective

at handling this 1 piece of the way that data is used

either for my own personal projects or in my organization,

and then using that as an opportunity to learn a little bit more. So definitely

don't try to learn everything all at once, but instead

try to practice the principle of just in time learning of, I know that I need to get this done today or tomorrow, so I'm going to learn a bit more about this. And then that will open up new avenues to explore as you go, and you'll always be able to make forward progress.

In terms of

my work on the podcast than on the book,

I started this podcast

partly because I just wanted to learn more about the space. So

I,

as I said at the beginning, have experiences

in software development and systems automation and, systems administration.

And I knew some of the principles of data engineering because I had built some pipelines, but I really wanted to have the opportunity to learn from the experts and help share that knowledge through the podcast. And so I've had a really great opportunity to be able to make that happen. And in the past 5 years, I've gained a lot of knowledge personally.

And

as a result,

I've

been started to be viewed as an expert in the field because of the podcast.

And so

some of the most interesting and innovative and unexpected outcomes that I've seen from running the podcast and working on the book and some of the feedback that I've gotten

is that

I never thought that I would be this far along

in my journey as a data engineer and that I would have been able to prove provided

so much

information and knowledge

to

the community

and that the podcast would ever get to be as popular as it is now. I've actually had some folks who have written to me to say that they were interested in data engineering.

And through listening to the podcast and

learning the lessons that way, they were actually able to break into the industry and get a job as a data engineer because of the things that they learned through the podcast. I've also had people who write to tell me that

they've been working in data engineering or maybe they manage a data team. And because of the things that they've heard about in the podcast or lessons that they've learned, they've been able to make

substantial improvements to the way that they manage data in their organization. So it's definitely very

humbling and gratifying

to have been able to provide such a resource to so many people. In terms

of the

sort of interesting and unexpected and challenging lessons that I've learned while working on the podcast and the 97 Things book is that I didn't know at the outset

how much detail there is and how wide this ecosystem

can be with so many different considerations ranging from, you know, just the pipeline design,

ETL,

databases,

you know, data lakes into metadata management, data governance, data privacy, data security.

So

I've definitely learned a lot there

and I've also learned a lot about

how to

kind of evaluate

the utility of different products or services that are out there because I've had so many people reach out to me saying that they wanna be on the show to talk about various things,

or I've come across different things in my own work to try and understand is this useful or maybe this is something that's worth exploring on the podcast, but really being able to have

a useful application

of kind of judicious skepticism.

So

not taking

claims at face value, but knowing how to

look

to see

how well is this particular

vendor or product addressing these fundamental

foundational elements of data challenges or organizational

challenges?

And when is it something new and novel versus when is it just repackaging of something that's already been done before? So how do I figure out

whether to

talk about

5 different data quality tools because they're all taking it in a different direction

versus

only talking to 1 or 2 of them because there's only 1 or 2 kind of novel ideas in the space. There's definitely a lot of novel ideas in the space of data quality. So I don't wanna suggest that it's not nuanced and detailed, but just using that as a sort of top of mind example.

And

I was also

very grateful for the opportunity to work on the 97 things book both as a way to

have the opportunity to reach out to folks in the community to

get some more detail from them about things that they're working on as well as a way to provide

a

resource

that people can use to get a

broad surface level view of a lot of the things that are happening in the space.

So for anybody who is interested in

breaking into the industry or learning more about some of the foundational principles or figuring out what are the topics that are worth exploring, I think the book does a good job of that.

And I'll also say that if you really wanna get a solid foundational introduction to a lot of the kind of computer science and systems design principles that go into all the systems that we work on. I highly recommend reading the book Designing Data Intensive Applications because it is a

fantastic

resource that has an appropriate level of detail on all of the different

computer science and distributed systems concepts that go into the things that we're working on.

There are lots of other great resources out there.

So there are,

some useful

medium blogs.

Definitely recommend checking out what the folks at Data Council are doing. They've got a great community.

You know, they

they do a lot to help further the ecosystem

and work with both

open source projects and newcomers to the community as well as helping businesses

kind of reach their audience and understand what are the challenges that people in the trenches are going through every day. And

there are also some other communities that are starting to grow up. There's 1 that's growing up around data mesh. So I think there's,

it's called data mesh learning.

There's an MLOps community that's been gaining a lot of ground. So I'll add links to all these in the show notes. And

so definitely just wanna thank everybody for

giving me your attention,

taking the time to listen to the shows that I put out.

Very happy that I've been able to provide something that is useful.

And so

with that, I'm going to add my contact info to the show notes. It's already on the website but for folks who maybe wanna follow-up.

Definitely,

if you liked this format of having me go on a monologue and talk about some of the things that I've been seeing, let me know, if there are any particular topics that you want me to dig deep into from what I've seen or what I've worked on,

just send me a message. I'm happy to try this out again. And so

as my final question to myself, I'd like to share my perspective on what I see as being the biggest gap in the tooling or technology that's available for data management today. And I think that right now, it's actually in this space that I mentioned earlier of the kind of split between

software and application development and data engineering and data platforms

where

we're starting to evolve to this space where data and stable interfaces are a first class consideration

of

applications, but the tooling is not quite there to make that easy for application developers to be able to say, okay,

this is what I want to expose where we have good resources to define database models or to define APIs,

but it's not straightforward and out of the box to say, okay. These are the things that you need to know about how to build an API that is useful for pulling out analytical data, whether it's in batch format or being able to do event publication so that I can maybe feed that into

an event bus or being able to do some, like, change data capture style approach from the stable interface so that I can just, as a data engineer or as a data platform, consume those events

incrementally.

So there's definitely a lot of opportunity to be able to build tooling and systems that make it easier for people who are working with these, you know, web frameworks or application frameworks to create these

analytical interfaces

without having to reengineer it from scratch every time.

And so with that, I definitely wanna thank everybody for listening. I'll add some links to the show notes with useful references.

Appreciate everything that everybody has done to help get this show to where it is today. And thank you, and have a great rest of your day.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links