In this solo episode of the Data Engineering Podcast, host Tobias Macey reflects on how AI has transformed the practice and pace of data engineering over time. Starting from its origins in the Hadoop and cloud warehouse era, he explores the discipline's evolution through ML engineering and MLOps to today's blended boundaries between data, ML, and AI engineering. The conversation covers how unstructured data is becoming more prominent, vectors and knowledge graphs are emerging as key components, and reliability expectations are changing due to interactive user-facing AI. The host also delves into process changes, including tighter collaboration, faster dataset onboarding, new governance and access controls, and the importance of treating experimentation and evaluation as fundamental testing practices.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.
- Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
- Your host is Tobias Macey and today I'm interviewing reflecting about the increasingly blurry boundaries between data engineering and AI engineering
- Introduction
- I started this podcast in 2017, right when the term "Data Engineer" was becoming widely used for a specific job title with a reasonably well-understood set of responsibilities. This was in response to the massive hype around "data science" and consequent hiring sprees that characterized the mid-2000s to mid-2010s. The introduction of generative AI and AI Engineering to the technical ecosystem is changing the scope of responsibilities for data engineers and other data practitioners. Of note is the fact that:
- AI models can be used to process unstructured data sources into structured data assets
- AI applications require new types of data assets
- The SLAs for data assets related to AI serving are different from BI/warehouse use cases
- The technology stacks for AI applications aren't necessarily the same as for analytical data pipelines
- Because everything is so new there is not a lot of prior art, and the prior art that does exist isn't necessarily easy to find because of differences in terminology
- Experimentation has moved from being just an MLOps capability into being a core need for organizations
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL. The result, inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect. You're a developer who wants to innovate. Instead, you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers. MongoDB is asset compliant, enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads.
Ready to think outside rows and columns? Start building at mongodb.com/build today. Your host is Tobias Macy, and today I'm going to talk about some of the shifts that we've been seeing in recent years as a result of the AI and AI engineering practices. And I started this podcast back in 2017, which is right at the point that the term data engineering was really becoming widely used and widely adopted, and it got attached to a job title and the responsibilities for that job title were generally pretty well understood. And from what I've seen, the creation of data engineering as a specific discipline and as an outgrowth of things like data warehousing and business intelligence engineers and DBAs was that the early to mid two thousands through until the mid twenty tens was when we really had the massive growth in popularity and hype around the idea of data science and data scientists and how that was the hot new job of the twentieth century, and that was what everybody wanted to do.
And so there was a massive hiring spree of large companies and the growth of the Internet era for people who wanted to be able to collect and make sense of all of the data that was available to be had because of the growth of Internet usage. And they wanted to be able to turn that raw information into valuable and actionable insights and assets that they could then generate new revenue streams from. And a lot of the people who got hired for that role of data science and data scientists and machine learning engineers didn't actually have the data that they needed when they started in those positions. And so rather than coming into a company, building a model, building a forecast, making predictions, and automatically being able to drive a huge amount of revenue for the company, they instead had to do all of that hard work and heavy lifting of finding the data that existed, understanding its context, understanding how it fit together, collecting it, cleaning it before they could even do the work that they are hired to do. And so that was what really led to the growth of data engineers as a job title as it was a means of being able to differentiate those two areas of work and capitalize on the investments of these data scientists and the statistical and modeling and machine learning skills that they had by bringing in people who understood how to work with data, how to make it reliable, how to clean it, how to present it to the teams that then wanted to take action on it.
And the data engineers also pretty quickly started to take over a lot of the business intelligence work as well with caveats. And so that, I think, too was also what led to some of that analytics engineer as a role where you didn't want to necessarily have your data engineers spending all of their time building reports, working with the business stakeholders because then they weren't doing the job that they were hired to do. And so analytics engineers came in, we started to have this fracturing of the titles and the different business roles. And so we kept hiring more people who are working with data in different avenues, and there were fairly clear delineations between who was doing what at what points in the overall data life cycle.
Around when I started this podcast, it was also on the tail end of the Hadoop era where the whole idea was, hey. We'll just collect all of the data. We'll throw it into these massive data lakes. We'll do some fancy map reduce on it, and eventually we'll get some value out of that data. And so a lot of companies invested a substantial amount into those infrastructures and into that architecture, and it didn't ultimately pan out for a number of them. And so technologies such as Redshift was at the forefront, but since then Snowflake, ClickHouse, a lot of these cloud data warehouses and columnar engines came in to be able to give us the ability to collect a lot of the data, but keep it structured, operate on it. And that came to be a lot of the work that data engineers would do was fill the data warehouse with all of that structured and semi structured data, make it reliable, make it repeatable, make sure that we had ways of bringing that data in and doing some activity on it and then providing clean interfaces for downstream consumers of that data, whether that was the business intelligence or the machine learning engineers.
In the late twenty tens was when we really started to have that uptick in deep learning, which was what then led to the idea of ML ops and operationalizing these machine learning workflows because it became more practical and achievable to be able to build useful models of the data without necessarily having to do as much upfront work and experimentation. Obviously, it was still a very core piece of that workflow, but you didn't necessarily have to do every little bit of fine tuning of the features that you were feeding in because the deep learning algorithms could pick out some of those patterns for you.
And so we had the growth of machine learning engineer and ML ops as a role and as a set of technologies. And so many of those were an outgrowth of the existing data engineering technologies and orchestration workflows, but also many of them were net new and built specifically for those deep learning and ML use cases. Now over the past two years in particular with the growth of generative AI and large language models, a lot of those distinctions and differentiations have really been blurred. About two years ago, I started the AI engineering podcast because of this growth in this new style of work and set of requirements, and I've largely kept them as distinct shows. But the more time goes by and the more that adoption of these capabilities and technologies grow, the harder it is to really differentiate between is this a data engineering topic, is this an AI engineering topic? Because in many cases it's both where the data that we need is something that you have to prepare for use by the AI models. Many of the AI models can be used in preparation of the data, particularly when we talk about things like unstructured and even semi structured data where for a long time PDFs, large free text documents, audio, even things like images and video were stored and you could do some metadata extraction and gain some insight about it. But the actual content of that information was largely put to the side and not used in the day to day of data engineering workflows.
Now that we have these models that are capable of processing larger chunks of that raw data and being able to extract meaning and semantics and pieces of detail from it, we're at a point where we now have to start thinking about, okay, what are the pieces of unstructured data that are going to be useful? What are the common factors that I want to extract out of it? How do I turn this into something that can be stored in a data warehouse or adjacent to a data warehouse? So that's one of the major ways that data engineering has been shifting is that we need to bring in some of these language models and other probabilistic technologies into what has been a fairly deterministic workflow.
Composable data infrastructure is great until you spend all of your time gluing it back together. Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.
Go to dataengineeringpodcast.com/bruin today to get started. And for DBT cloud customers, they'll give you a thousand dollar credit to migrate to Bruin cloud. Another aspect of the ways that these language models and generative systems have changed the work of data engineers is that we have new types of data assets that we're responsible for with vector databases and vectors being the biggest example where before we would have tabular data, you would maybe do some feature engineering on that and then send that to the training and maybe hydration of a machine learning model for being able to create some prediction or perform some action.
But now we need to take structured, semi structured, and unstructured data, turn them into those vector embeddings, and then store them in a relatively new technology for many in the form of these vector databases so that the models can retrieve that at inference time and be able to do it quickly and effectively. And so that still requires a lot of the skills that we have as data engineers, but the requirements around the ways that we're structuring that data are changing. Another important aspect of what's really changing is that unless you were working in an organization that relies heavily on real time streaming data, the SLA or service level agreement or the reliance of that data and its timeliness has changed a lot. And the uptime in particular of that data has changed a lot where if you were building something for a data warehouse that was feeding into a business intelligence system, there's a pretty high probability that if it goes down for 15 or an hour in the middle of the night while you reload the data warehouse or update data or fix some bugs, it's not gonna be that big of a deal. But if you are running a vector store and it is powering a customer facing LLM that is doing inference and providing an interactive use case, if that same downtime happens as a much bigger deal. And so then you get into some of the operational characteristics of the systems that we're building where it's not just a one way flow of information, we need to be able to actually start generating new information and new insights from those interactions as well, which is where things such as memory stores come in, the use of language models in particular, and ideas such as retrieval augmented generation have also driven a renewed interest in graph technologies because of being able to build knowledge graphs and semantic graphs of the context.
And so at the core, the purpose of the work that we do as data engineers hasn't substantially shifted, but the shape that it has taken has where really at the base level, the purpose of data engineering is to turn raw information into useful knowledge. But the context in which that knowledge is being exposed and utilized has changed. It has also started to blur the lines of the responsibilities where you don't necessarily have a data engineer who hands off to an analytics engineer or a data engineer who hands off to a machine learning engineer who then maybe hands their model off to an application engineer to incorporate it.
Those teams all have to work much closer together to be able to build useful products that can quickly go from raw data to useful inference as quick as possible because the capabilities of these models keep changing, the use cases for these models keep changing, And the pace at which we need to deliver is changing because these AI models also have a substantial impact on our ability to be productive and deliver quickly and iterate quickly. In particular, because of their ability to generate code as well as generate new insights, and also because it shortens the cycle from a business stakeholder or a customer being able to have an idea or ask a question and then get a useful answer. Where before, if you didn't already have a report to be able to answer a given question, even if you had the raw data somewhere, that business stakeholder would have to file a ticket or ask an analytics engineer to say, hey, I really wanna understand what are the types of feedback that I'm getting from my customers.
Maybe that even required building a new model to be able to parse that data in natural language and extract the sentiments of the feedback where now a lot of that natural language processing can happen with these language models. And so it's really just a matter of, let me take all of these unstructured feedback from customers, whether that's from things like Zendesk or audio recordings, and you can run that through and be able to actually get an answer pretty quickly. And it might not even require the involvement of a business analyst to be able to provide that response because the language model can take on some of that workflow. And so as data engineers, we're also seeing a lot of pressure to be able to onboard new datasets faster, make them accessible to these models at a higher rate and in higher volumes because the ability to unlock value from it has dramatically shifted.
Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy? DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they turn months long migration nightmares into week long success stories.
Another way that data engineering and AI engineering are starting to blur the lines and become unclear is that a lot of the same terminology is being used with orchestration being the biggest example where data orchestration has typically meant things like DAXTER airflow prefect where I have my ETL script. I know that each step needs to be able to consume this data, produce this data, and then you run it over and over again. But now orchestration is also being brought in in the context of these agentic AI systems where I need to orchestrate the execution of that agentic loop. I need to provide data to the agent for it to be able to make decisions. I need to be able to provide ways for the agent to be able to access data, which also dramatically increases the responsibilities around governance and security and access controls.
And so having those two different styles of orchestration can make it challenging to understand which type of orchestration do I need, where does that orchestration live, do I just use my existing ETL orchestrator for some of these agentic use cases, does it have the ability to operate at those speeds, do I maybe use something like a Daxter or a Prefect to execute another orchestrator and maintain its loop as an extension? What are the different styles of loops that these agents need? Do they all have to be running in a tight loop in a single process? Or is this something where an LLM gets called, it produces some output that gets stored as an artifact somewhere, another step picks up, runs it through a different LLM with a different set of instructions. So there's really a lot of unknown patterns and a new evolution of capabilities and ways of thinking about these things and decomposing these workflows that some people have experience with, but many of us don't. And so there's really a lot of lack of established practices, which makes this an even more challenging time to be an engineer in this space, but it also provides a lot of opportunity to contribute to discovery and creation of those useful patterns and educating each other on how to be able to understand what are those responsibilities, how do we handle those interfaces and manage those handoffs between the different stages?
Maybe one of the biggest new requirements, especially for data engineers, is the fact that the way that we test these different workflows has to change where we have things like data quality monitoring, we have unit tests, we have integration tests to make sure that if the logic breaks, if some new data comes in that breaks our prior assumptions, we would react to it and fix it. But as the models start to become more of the core execution either in terms of the actual pipelines themselves or in terms of the serving of what the data is feeding into, everybody in the whole workflow needs to be able to include experimentation and evaluation as that means of building confidence and verifying changes and maintaining functionality.
And so many machine learning teams and AI teams have that experimentation practice and there are some sets of tooling that are available to be able to manage that, but the scope and scale of it has definitely dramatically expanded. And so there is a lot of there's a lot of new discovery that has to happen to make sure that we can effectively operationalize these experimentation workflows and make sure that they can be executed rapidly so that we can continue to build and evolve our systems without fear of breakage. Because as new models come out, the behaviors change. As new data comes online, the behaviors change. We need to be able to understand the effect that different instructions will have for different models or even for the same model. So there's a huge number of variables that are being introduced that didn't exist as prevalently as they do now, where even just going from deterministic software where maybe I'm building a web application or a mobile application to data engineering, the dimensionality of the complexity goes up in order of magnitude.
We're now at a stage where we're going another order of magnitude of complexity going from data engineering to AI engineering and building these AI systems. And so we have a lot of the capabilities and the understanding to be able to do that, but we really need to be investing in that evaluation flow as a means of confidence building in order for us to be able to get that flywheel in motion and maintain momentum in order to be able to keep up with the pace of change because it's just going to keep speeding up or at least stay the same as what it is now. It's never going to go back to what it was five years ago. And so as data practitioners, we really need to be thinking about what are the set of skills that I have and how do I apply them to this expanded set of responsibilities and complexity?
How do I collaborate more closely with machine learning teams, with software engineers? How do I use these models to be able to do my own work more effectively? And how do I gain some level of, how do I gain some level of familiarity with what these models can and can't do so that I can be most effective? And so going forward with the podcast, I'm going to be juggling that question of what are those boundaries? When does this make sense to be discussed in a data engineering context versus an AI context? And so as you continue to listen, I'm always open to feedback for what are the questions that you have as somebody who is working in this space, how can I best surface and explore some of the changing responsibilities, the changing technologies, and those blurring boundaries? So I'm going to keep digging into this. I'm working in this space every day as well, so I appreciate the complexity and the uncertainty that we are all facing. But it's also a very exciting time to be working in this space because of all these new capabilities and because with these models, we can move beyond just being plumbers of data to being able to operate at a higher level of abstraction and capability because many of these models can handle some of the rote and menial work that we've been stuck with for a long time. So going forward, very excited for the technologies that we have built and rely on to continue to evolve and adapt and bring in new capabilities powered by these AI systems as well as being able to extend the governance policies and controls so that we can ensure that we're doing this in a safe manner and a deliberate manner.
And I appreciate you taking the time to listen. I really think that going to my usual question of the biggest gap in the tooling of technology for data management today, it is that set of patterns and practices of how and where to bring AI to bear and how the ways that we think about data structures and delivery need to evolve to be able to accommodate these new patterns of access. So with that, thank you for taking the time to listen, and I hope you enjoy the rest of your day. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and purpose: AI is reshaping data work
From data science hype to the rise of data engineering
Role specialization: analytics engineers and clearer handoffs
Post‑Hadoop shift to cloud warehouses and reliable pipelines
Deep learning era and the birth of ML engineering and MLOps
Generative AI blurs boundaries between data and AI engineering
Bringing LLMs into deterministic pipelines for unstructured data
New assets: vectors, vector databases, and changing SLAs
Operational demands: uptime, memory stores, and knowledge graphs
Closer collaboration and faster delivery with model‑driven workflows
Two meanings of orchestration: ETL vs. agentic AI loops
Testing evolves: from data quality to experimentation and evals
Skills and mindset: applying strengths to expanded scope
Show direction: navigating blurred boundaries and feedback
Looking ahead: capabilities, governance, and patterns gap
Closing notes and where to find related shows