Tobias Macey

Host of Data Engineering Podcast

Tobias Macey is a dedicated engineer with experience spanning many years and even more domains. He currently manages and leads the Technical Operations team at MIT Open Learning where he designs and builds cloud infrastructure to power online access to education for the global MIT community. He also owns and operates Boundless Notions, LLC where he offers design, review, and implementation advice on data infrastructure and cloud automation.

In addition to the Data Engineering Podcast, he hosts Podcast.__init__ where he explores the universe of ways that the Python language is being used. By applying his experience in building and scaling data infrastructure and processing workflows, he helps the audience explore and understand the challenges inherent to data management.

Tobias Macey has hosted 422 Episodes.

Designing A Non-Relational Database Engine

April 14th, 2024 | 1 hr 16 mins

Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

April 7th, 2024 | 56 mins 23 secs

Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

March 31st, 2024 | 50 mins 44 secs

Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

March 24th, 2024 | 55 mins 39 secs

A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.
Reconciling The Data In Your Databases With Datafold

March 17th, 2024 | 58 mins 14 secs

A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
Version Your Data Lakehouse Like Your Software With Nessie

March 10th, 2024 | 40 mins 55 secs

Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.
When And How To Conduct An AI Program

March 3rd, 2024 | 46 mins 25 secs

Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization.
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

February 25th, 2024 | 56 mins

Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

February 18th, 2024 | 58 mins 46 secs

A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.
Data Sharing Across Business And Platform Boundaries

February 11th, 2024 | 59 mins 55 secs

Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
Tackling Real Time Streaming Data With SQL Using RisingWave

February 4th, 2024 | 56 mins 55 secs

Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Build A Data Lake For Your Security Logs With Scanner

January 28th, 2024 | 1 hr 2 mins

Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Modern Customer Data Platform Principles

January 21st, 2024 | 1 hr 1 min

Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

January 7th, 2024 | 50 mins 26 secs

Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
Designing Data Platforms For Fintech Companies

December 31st, 2023 | 47 mins 56 secs

Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
Troubleshooting Kafka In Production

December 24th, 2023 | 1 hr 14 mins

Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.

April 14th, 2024 | 1 hr 16 mins

April 7th, 2024 | 56 mins 23 secs

March 31st, 2024 | 50 mins 44 secs

March 24th, 2024 | 55 mins 39 secs

March 17th, 2024 | 58 mins 14 secs

March 10th, 2024 | 40 mins 55 secs

March 3rd, 2024 | 46 mins 25 secs

February 25th, 2024 | 56 mins

February 18th, 2024 | 58 mins 46 secs

February 11th, 2024 | 59 mins 55 secs

February 4th, 2024 | 56 mins 55 secs

January 28th, 2024 | 1 hr 2 mins

January 21st, 2024 | 1 hr 1 min

January 7th, 2024 | 50 mins 26 secs

December 31st, 2023 | 47 mins 56 secs

December 24th, 2023 | 1 hr 14 mins